NAISS
SUPR
NAISS Projects
SUPR
Using transformers to analyze sequential data from biology and chemistry - expanded
Dnr:

NAISS 2025/5-534

Type:

NAISS Medium Compute

Principal Investigator:

Erik Kristiansson

Affiliation:

Chalmers tekniska högskola

Start Date:

2025-10-01

End Date:

2026-10-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Webpage:

Allocation

Abstract

Artificial intelligence provides new, disruptive means to interpret and analyze complex biological and medical data. In particular, transformers enable analysis of sequential biological data, such as DNA sequences and medical health records, data that has previously been hard to efficiently incorporate in a deep learning perspective. Another example are graphical neural networks (GNNs), which provide a general framework to describe the often complex dependencies encountered in and between organisms. In this project, we use state-of-the-art AI methodologies to interpret complex biological and medical data. The project combines the development and fine-tuning of dedicated AI methods with the analysis of large volumes of biological data generated by international collaborative experimental groups. The project contains three major themes: a) analysis of the development, spread, and diagnostics of antibiotic-resistant bacteria, b) assessment of the toxicity of chemicals to humans and the environment based on their molecular structure, and c) DNA-based assessment of microbial biodiversity. In the first theme, we use graph attention networks (GATs), transformers, and more traditional machine learning to investigate genes that make bacteria resistant to antibiotics. This includes the analysis of DNA sequences, assessment of their spread between bacterial hosts, and development of decision support to guide antibiotic treatment. In the second theme, we use fusion transformers to describe chemical structures, genetic differences between organisms, together with other metadata, for accurate prediction of toxicity. Here, our ultimate aim is to replace animal testing with in silico predictions. In the third theme, we use transformers to interpret DNA sequences in order to classify them based on the taxonomy of their hosts. These models will improve the estimation of biodiversity and constitute an important building block for digital twins used to assess the effects of climate change. We are around 10 researchers working in total on this project. Our need for computational resources varies significantly between months, depending on data availability. Funding comes from multiple sources including VR, FORMAS, Wallenberg, Naturvårdsverket and the EU (JPIAMR).