Large-Scale Foundation Models for Proteomics and Data-Driven Molecular Biology

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/3-479

Type:

NAISS Medium

Principal Investigator:

Lukas Käll

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2026-06-24

End Date:

2027-07-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Secondary Classification:

10604: Cell Biology

Webpage:

https://kaell.se

Allocation

Arrhenius Disk at NAISS: 10000 GiB
Arrhenius GPU at NAISS: 4000 GPU-h/month
Arrhenius Flash at NAISS: 1000 GiB
Arrhenius CPU at NAISS: 20 x 1000 core-h/month

Abstract

We apply for GPU and CPU resources on Arrhenius to continue and expand our research programme on foundation models for data-driven cell and molecular biology. The central project develops self-supervised and attention-based models for mass spectrometry-based proteomics, with the goal of learning general-purpose representations of peptide tandem mass spectra at an unprecedented scale. The project has already produced accepted and submitted publications, new neural architectures for spectra, a 200M-spectrum pretraining corpus, and strong preliminary evidence that large-scale unsupervised pretraining substantially improves downstream de novo peptide sequencing. This research is now compute-bound. The next stage requires synchronized large-scale multi-node GPU training, fast access to very large datasets, and the ability to run long, high-throughput experiments over hundreds of millions of spectra. These requirements cannot be met by access to smaller clusters. The Arrhenius GPU partition, with NVIDIA Grace Hopper nodes and a large high-performance parallel filesystem, is essential for completing the planned work. Without access to this system, the key scientific objectives of the project cannot realistically be achieved within the project period. The requested allocation will support two complementary projects in the Käll group: Alfred Nilsson's work on large-scale self-supervised foundation models for MS2 spectra, and Yuqi Zheng's project on proteoform inference from shotgun proteomics data.