DNA metabarcoding has enabled large-scale biodiversity studies of groups containing considerable numbers of unknown taxa, such as insects. However, DNA metabarcoding usually yields large amounts of reads that algorithms fails to assign even at high taxonomic ranks. This genetic data belongs to poorly known sections of the tree of life, including both novel species and species lacking reference sequences. With the proportion of unassigned reads increasing dramatically in highly diverse localities, e.g. tropical areas, solving this issue is imperative.
The Insect Biome Atlas (IBA) project has generated large amount of metabarcoding data of Malaise trap samples from Sweden and Madagascar sites, which often fail to be identified even at order level. One promising approach to improve classification is phylogenetic placement, where individual sequences are classified based on their placement in a reference tree. Our goal is to evaluate the use of phylogenetic placement to improve taxonomic classification of metabarcoding samples in comparison with other algorithms. To achieve this, we will build a reference tree spanning the entire diversity of Malaise trap samples. We will then run phylogenetic placement on both benchmark barcoding datasets and metabarcoding data from tropical arthropod samples, amounting to tens of thousands of sequences, using EPA-NG, a maximum likelihood phylogenetic placement algorithm.
In addition, the massive amounts of DNA information provided by metabarcoding data is a fantastic opportunity for building the insect tree of life. However, DNA barcodes contain little phylogenetic information, hence generally insufficient to infer deep evolutionary relationships, and the large number of data imposes strong computational limitations. Within this project, we also aim to build upon both the reference trees described above and the metabarcoding data to evaluate the performance of different algorithms for phylogenetic reconstruction for generate very large phylogenies of barcodes.