Genome-scale species classification of ancient and modern sequence reads

NAISS 2023/5-251


NAISS Medium Compute

Principal Investigator:

Tom van der Valk


Naturhistoriska riksmuseet

Start Date:


End Date:


Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Secondary Classification:

10611: Ecology



Metagenomic samples are a gold mine Just a few grams of a metagenomic sample such as obtained from lake and river water, sediments, soil or faeces can harbour the genetic information of thousands of organisms (Bohmann et al. 2014; Thomsen and Willerslev 2015). The wealth of data contained in such samples can be utilised for a wide range of applications, including large-scale biodiversity monitoring, species detection or microbiome and dietary inferences of individuals (Taberlet et al. 2018). Current analysis methods are limited The research community has traditionally focused on amplifying and sequencing targeted, small stretches of DNA that are unique for a focal set of species (metabarcoding). Such methods only analyse a minute fraction of the total DNA in the sample and accuracy and sensitivity of metabarcoding methods are therefore far below the theoretical possibilities. Within the last decade, sequencing costs, high performance computer clusters and genome assembly databases have improved by orders of magnitude. It is now financially feasible to sequence nearly all of the DNA within an metagenomic sample and together with the rapidly expanding databases leverage most of the information contained within such samples. In this project we aim to develop a pipeline that can efficiently classify sequence reads to their species origin, while accounting for DNA quality (short and damaged reads), (lab)contamination and technical sources of error such as genome assembly errors, GC-sequencing biases and! sequence errors. We will work on developing and improving on existing classifying tools (including ganon(Piro et al. 2020), Meta-align(Tomii et al. 2020) and Kraken2(Wood et al. 2019)), aimed at increasing the scope of metagenomic data to genome-wide analysis at high sensitivity while maintaining accuracy. We aim to make the pipeline and database-build publicly available to the scientific community. Bohmann, K. et al. (2014) 'Environmental DNA for wildlife biology and biodiversity monitoring' In: Trends in ecology & evolution 29 (6) pp.358–367. Piro, V. C. et al. (2020) 'ganon: precise metagenomics classification against large and up-to-date sets of reference sequences' In: Bioinformatics 36 (Suppl_1) pp.i12–i20. Taberlet, P. et al. (2018) Environmental DNA: For Biodiversity Research and Monitoring. (s.l.): Oxford University Press. Thomsen, P. F. and Willerslev, E. (2015) 'Environmental DNA – An emerging tool in conservation for monitoring past and present biodiversity' In: Biological conservation 183 pp.4–18. Tomii, K. et al. (2020) Meta-Align: A Novel HMM-based Algorithm for Pairwise Alignment of Error-Prone Sequencing Reads. In: bioRxiv At: (Accessed 26/08/2021). Wood, D. E. et al. (2019) 'Improved metagenomic analysis with Kraken 2' In: Genome biology 20 (1) p.257.