(Meta)genome mining approaches for novel secondary metabolite discovery

NAISS 2023/5-422


NAISS Medium Compute

Principal Investigator:

Laura Carroll


UmeƄ universitet

Start Date:


End Date:


Primary Classification:

10606: Microbiology (medical to be 30109 and agricultural to be 40302)




Secondary metabolites produced by microbes allow their respective producer organisms to interact with their environment and respond to stimuli and stressors. In the context of human host-associated microbial communities, secondary metabolites modulate host health via a range of processes (e.g., immune system regulation, xenobiotic and nutrient metabolism, cancer susceptibility/resistance). Furthermore, many bacterial secondary metabolites have found important uses in medicine and industry, including as revolutionary antimicrobial, anticancer, and antidiabetic drugs, while others are toxic to humans. Consequently, there is an incentive to identify novel metabolites, which may potentially be leveraged in innovative therapies or applications. Here, >1 million microbial (meta)genomes will be queried for biosynthetic gene clusters (BGCs), clusters of genes that encode the enzymatic machinery responsible for microbial secondary metabolite production. The set of (meta)genomes includes: (i) a set of ~300 thousand previously published metagenome-assembled genomes (MAGs) derived from the human gut microbiome (~5 Tb); (ii) ~35 thousand MAGs derived from ocean microbiome samples (~2 Tb); and (iii) ~6 thousand Bacillus cereus group (meta)genomes (~2 Tb, including genomes that are not yet assembled). Further, in an attempt to expand the project, we will identify BGCs in the largest number of prokaryotic genomes ever queried; specifically, we will further identify BGCs in (iv) every genome used to construct the mOTUs v3 database (700k genomes; ~13 Tb); (v) a set of 10k previously unassembled bacterial genomes, which are currently represented by sequencing reads (~10 Tb). antiSMASH ( and GECCO ( will be used to query all (meta)genomes, and the resulting BGCs will be used to: (i) construct the largest-ever atlas of BGCs, which can be used by experimentalists to identify novel natural products in a variety of microbiomes; (ii) identify novel secondary metabolites involved in microbiome-mediated diseases, including colorectal cancer, inflammatory bowel disease, Crohn's disease, and type II diabetes; (iii) identify novel BGC-encoded antimicrobials with potential applications in human and/or animal medicine; (iv) identify novel secondary metabolites involved in host-microbe and microbe-microbe interactions in various biomes, including the human gut, human vaginal, and ocean microbiomes; (v) improve the accuracy of current state-of-the-art machine learning approaches for de novo BGC discovery. Through this project, which will involve mining millions of microbial (meta)genomes for millions of BGCs, we hope to discover novel secondary metabolites, which play critical roles in human, animal, and/or environmental health (e.g., novel toxins, carcinogens, antimicrobials).