SUPR
Arthropod Population Metagenomics
Dnr:

NAISS 2025/22-1025

Type:

NAISS Small Compute

Principal Investigator:

Samantha López Clinton

Affiliation:

Naturhistoriska riksmuseet

Start Date:

2025-07-31

End Date:

2026-08-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Webpage:

Allocation

Abstract

I am a doctoral student at the Department of Bioinformatics and Genetics at the Swedish Museum of Natural History, the Centre for Palaeogenetics, and Stockholm University. My research uses metagenomic data to explore arthropod population genomics, with a focus on high-throughput shotgun sequencing datasets from both modern and ancient environments. The work spans four main projects, all of which require substantial computational resources for taxonomic classification, reference-based mapping, and population genetic inference. The first two projects analyse ∼100 metagenomic samples. I’ve developed a custom 1.1 TB arthropod genome reference database to classify reads and recover genome-wide data from complex, mixed-species samples. Downstream analyses include mitochondrial haplotype calling, estimation of heterozygosity, PCA-based geographic assignment, and multi-species population structure inference. These are also run on previously published, high quality data - to facilitate comparison with my own samples. I use tools which benefit from parallelisation (see Resource Usage) and high-memory nodes during classification, alignment, and genotype likelihood estimation. The second two projects focus on ancient DNA. One of them re-analyses public data from 2 million-year-old Greenland sediments, focusing on the recovery and dating of ancient arthropod DNA using molecular clock models. The other compiles a dataset of Holocene sedaDNA from northern Eurasia to reconstruct arthropod community changes through time. Analyses involve aDNA-aware preprocessing (e.g., AdapterRemoval, mapDamage), low-stringency alignment, taxonomic assignment, and ecological clustering. Compute demands include frequent batch jobs for read preprocessing and mapping against large reference databases, as well as array jobs for downstream per-sample and per-taxon analyses. Storage needs are moderate (a few TB) as I can use some resources in my PIs project (NAISS 2025/22-968), but temporary file usage and RAM during classification, mapping, and population-level inference can be intensive. I aim to use NAISS small-scale resources primarily for alignment, variant calling, and genomic diversity metrics, as well as for more small scale data parsing and figure generating with Python. As part of the Data Driven Life Science initiative, this work is conducted within the Swedish Museum of Natural History and Stockholm University. It explores how complex metagenomic data—both modern and ancient—can be used to uncover population-level patterns. The findings and methods could help contribute new possibilities in the fields of conservation, ecology, biodiversity monitoring, and evolutionary biology.