Using transformers to analyze sequential data from biology and chemistry

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2023/22-752

Type:

NAISS Small Compute

Principal Investigator:

Erik Kristiansson

Affiliation:

Chalmers tekniska högskola

Start Date:

2023-08-01

End Date:

2024-08-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Webpage:

Allocation

Alvis at C3SE: 1000 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

Transformers has revolutionized natrual langeuage processing and, more recently, also shown to have a large potential for many applications within the life sciences. In this project we will use transformers to analyze sequential data from two biological sources: a) DNA sequence data from genes that make bactera resistant to antibiotics and b) molecules that are toxic to the environment. In the first project, we use transformers to describe the dependencies of protein sequences in order to discriminate between genes that provide resistance to antibiotics and those that don't. We have already implemented a BERT-like transformer model that is first pre-trained on a large volume of unlabeled data and then fine-tuned on a smaller dataset of experimentally validated data. Preliminary results shows superior performance compared to traditionally used algorithms for sequence comparisons (e.g. hidden Markov models) In the second project, we use transformers to describer the structure of molecules from a wide range of chemicals. Here we use a pre-trained transformer (ChemBERTa) which is then combid with additional data in a deep neural network, and then used to predict chemical toxicity. Our preliminary results shows that our transformer model can be applied to a much larger range of chemicals compared to existing methods (QSAR-models). In both of these project we are in need for computational resources to scale our models but also the data to perform proper evaluation. We have previously worked with the models on local computers and/or using cloud services but we do not see this strategy as sustainable anymore. The project are funded from multiple sources, including VR, CHAIR, CARe (https://www.gu.se/en/care) and FRAM (https://www.gu.se/en/fram-chemical-risk-assessment). Klassificering: Bioinformatik, Mikrobiologi, Infektionsmedicin, Farmakologi och Toxikologi, Annan kemi, Miljövetenskap