A genotype likelihood based pipeline for analyzing population genomic data of low coverage samples

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2023/22-600

Type:

NAISS Small Compute

Principal Investigator:

Zachary Nolen

Affiliation:

Lunds universitet

Start Date:

2023-05-25

End Date:

2024-06-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Webpage:

Allocation

Crex 1 at UPPMAX: 128 GiB
Rackham at UPPMAX: 10 x 1000 core-h/month

Abstract

While the majority of population genomic analyses are performed using called genotypes, certain biases are introduced when using called genotypes on low coverage data. Genotype likelihood based analyses can help alleviate some of these concerns by incorporating the uncertainty in genotype at each position into the analyses. However, due to the relatively recent development of these tools, it can be challenging for researchers to approach these methods efficiently, as documentation can be fragmented and resources for high quality methods in the literature are only recently becoming available. Here, we will develop a Snakemake based pipeline focused on taking raw paired end sequencing data and processing it through common population genomic analyses within a genotype likelihood framework, allowing for use of a curated set of settings developed through extensive literature review or a set of custom settings for user's unique needs. Due to the modular nature of Snakemake, new analyses can be easily added as they become available, allowing users to perform only what is necessary for their dataset. As the primary users of genotype likelihood based methods are those with low coverage DNA due to sample age or quality, the analyses incorporated will have a focus on what is most relevant for projects facing these challenges. Our initial focus will be incorporating analyses for standard population genetic measures such as pairwise pi, Watterson's theta, Tajima's D, and heterozygosity. We will incorporate measures of population structure through both PCA and admixture analyses and allow estimating population differentiation through Fst. As many researchers working with low coverage data come from the field of conservation genomics, we will also aim for analyses that might be of particular interest to these studies - inbreeding coefficients, runs of homozygosity, as well as estimation of effective migration surfaces. Once completed, this pipeline will be published as open source code, allowing for continued, collaborative development to tailor its use to the needs of the research community.