SUPR
Generative models of DNA sequences
Dnr:

NAISS 2024/6-186

Type:

NAISS Medium Storage

Principal Investigator:

Aleksej Zelezniak

Affiliation:

Chalmers tekniska högskola

Start Date:

2024-05-30

End Date:

2025-06-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Webpage:

Allocation

Abstract

Current synthetic genomes are essentially clones or minimally adjusted redesigns of native sequences, which unfortunately does not teach us much about new biology. Implanting genomes with novel properties offers unique promise for addressing questions not easily approachable with conventional gene-at-a-time methods. These include questions about evolution and how genomes are fundamentally wired logically, metabolically, and genetically. Predictively mastering these principles will lead us eventually to design a new life from scratch intelligently. While protein language and DNA models have succeeded in multiple short-sequence design tasks (protein engineering, structure prediction, protein-protein, protein-drug interactions, regulatory sequence design), genome models currently do not exist. One of the limitations is i) that it is computationally challenging to model million-sized sequences1, and ii) most importantly, experimental validation of genome-scale models is challenging. Our lab is focused on synthetic genome design that addresses the limitations of experimental validation of genes, pathways and genome-scale sequence models. We are developing generative genome-scale models that will enable intelligently designing synthetic genomes, that is, by capturing multiple gene-gene to phenotype interactions, e.g., the presence or absence of different genes, compatibility of regulatory parts, and gene positions on the chromosomes as they relate to phenotypes. This resource is needed to mine vast genomics datasets and to do data engineering to prepare large DNA sequence datasets for training AI models. Resource Usage: We run workflow management software like Snakemake and Nextflow to mine genomics data.