We apply for GPU and CPU resources on Arrhenius to continue and expand our research programme on foundation models for data-driven cell and molecular biology. The central project develops self-supervised and attention-based models for mass spectrometry-based proteomics, with the goal of learning general-purpose representations of peptide tandem mass spectra at an unprecedented scale. The project has already produced accepted and submitted publications, new neural architectures for spectra, a 200M-spectrum pretraining corpus, and strong preliminary evidence that large-scale unsupervised pretraining substantially improves downstream de novo peptide sequencing.
This research is now compute-bound. The next stage requires synchronized large-scale multi-node GPU training, fast access to very large datasets, and the ability to run long, high-throughput experiments over hundreds of millions of spectra. These requirements cannot be met by access to smaller clusters. The Arrhenius GPU partition, with NVIDIA Grace Hopper nodes and a large high-performance parallel filesystem, is essential for completing the planned work. Without access to this system, the key scientific objectives of the project cannot realistically be achieved within the project period.
The requested allocation will support two complementary projects in the Käll group: Alfred Nilsson's work on large-scale self-supervised foundation models for MS2 spectra, and Yuqi Zheng's project on proteoform inference from shotgun proteomics data.