Imputation on pooled genotype data

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2023/22-735

Type:

NAISS Small Compute

Principal Investigator:

Camille Clouard

Affiliation:

Uppsala universitet

Start Date:

2023-08-01

End Date:

2024-08-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Webpage:

Allocation

Klemming at PDC: 500 GiB
Crex 1 at UPPMAX: 128 GiB
Snowy at UPPMAX: 5 x 1000 core-h/month
Rackham at UPPMAX: 5 x 1000 core-h/month
Dardel at PDC: 5 x 1000 core-h/month

Abstract

Research Project of the PhD studies: Computational methodologies and strategies for genotype imputation using pooled samples. Imputation in genetics consists in retrieving missing data (e.g. missing genotypes at given genetic markers, often SNPs). Genotyping is the procedure implemented for obtaining genotypes data sets on multiple individuals (human beings, animal, plants, …) – often several hundreds, and large numbers of markers – commonly millions. Over the past decade, progress in biotechnologies has made possible to genotype individuals at a decreasing cost. However, considering the exponential growth on data volumes and availability, and the emergence of so called big-data problematics, science and genetics still address the challenge of cutting costs in the analyses performed. That applies to genotyping for instance. Based on the fields of Information Theory and Group Testing, the research project aims to apply pooling designs (encoding and decoding) for diminishing the number of samples to genotype. This is done by gathering individuals into groups which are the samples being genotyped. That reduction is not for free though, pooling individuals results indeed in a loss of information about genotypes and alleles distribution among the pool. In the end, the uncertainty in data introduced by pooling appears as missing genotypes in the data set. The idea is then to reconstruct that missing data using imputation algorithms (primarily Beagle) and available knowledge on the pooling designs implemented. For Beagle’s v.4.0 algorithm, both steps of phasing and imputation are computationally greedy: they scale approximately linearly in the number of markers combined from reference panel and study population, and quadratically with the number of samples in the reference panel. For those reasons, a robust a powerful computational infrastructure is required.