SUPR
Identifying metabarcoding “dark matter” from Malaise trap samples
Dnr:

NAISS 2023/5-344

Type:

NAISS Medium Compute

Principal Investigator:

Nicolas Chazot

Affiliation:

Sveriges lantbruksuniversitet

Start Date:

2023-09-28

End Date:

2024-04-01

Primary Classification:

10611: Ecology

Secondary Classification:

10612: Biological Systematics

Webpage:

Allocation

Abstract

DNA metabarcoding has enabled large-scale biodiversity studies of groups containing considerable numbers of unknown taxa, such as insects. However, DNA metabarcoding usually yields large amounts of reads that algorithms fail to assign even at the highest taxonomic ranks. This genetic data that some authors have coined the “dark matter” of environmental DNA is mainly non-eukaryote DNA or belongs to poorly known sections of the tree of life. With the proportion of assigned reads increasing dramatically in highly diverse localities, e.g. tropical areas, solving this issue is imperative. One promising approach to improve classification is phylogenetic placement, where individual sequences are classified based on their placement in a reference tree. The Insect Biome Atlas project has generated large amounts of metabarcoding data of Malaise trap samples, from Swedish and Madagascar sites and a high proportion of such data fails to be even remotely identified. Our goal is to use phylogenetic placement algorithm to improve our taxonomic assignment and build a most accurate understanding of the biodiversity sampled at every sites. We are focusing on two kinds of information. For insect-related reads, we aim at obtaining a reliable family-level taxonomic assignment. For non-insect reads, we want to identify what kind of DNA accompanies these Malaise trap samples, including both Eukaryote DNA and non-Eukaryote DNA. To achieve this, we will build a reference tree spanning the entire tree of life and a higher density of reference taxa within the class Insecta. We will then run phylogenetic placement on metabarcoding data from tropical arthropod samples, amounting to tens of thousands of sequences, using EPA-NG, a maximum likelihood placement algorithm.