Deep Learning prediction of RNA tetraloops

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/5-16

Type:

NAISS Medium Compute

Principal Investigator:

Samuel Flores

Affiliation:

Stockholms universitet

Start Date:

2024-01-29

End Date:

2025-02-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Secondary Classification:

10601: Structural Biology

Tertiary Classification:

10603: Biophysics

Webpage:

https://www.su.se/english/research/research-groups/samuel-flores-research-group

Allocation

Alvis at C3SE: 5000 GPU-h/month

Abstract

Deep Learning is revolutionizing many fields of science, with one high point being AlphaFold2’s dominance of the CASP14 competition. Attention has turned to RNA structure prediction; indeed it is possible to predict RNA structure, however this currently requires high-quality Multiple Sequence Alignments which are not available for most RNAs. We therefore turn to the problem of predicting RNA structure from a single sequence. One step towards that goal is predicting motifs such as the GNRA and certain other tetraloops, which follow a sequence signature and fold into a specific 3D configuration. In this work we show that we can predict tetraloops with high accuracy, specificity, and sensitivity, based on a single 8-residue-long RNA sequence. Intrigued by the method’s prescience, we statistically analyze the the GNRA tetraloops in the dataset and find previously unpublished correlations, even suggesting that the motif should instead be called “GRNA.” The method is trained on a recently published dataset of tetraloops observed in experimental 3D structures; we supplemented this with non-tetraloop sequences from the same structures, giving a total of over 2.25 million sequences. The network itself is relatively simple, comprising 17 densely connected layers. Now we are testing a language model, DNABERT2. This requires significantly greater resources to train, hence the need for supercomputer time.