Multimodal representations of proteins (SubCellVS)

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/3-387

Type:

NAISS Medium

Principal Investigator:

Mathias Uhlen

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2026-06-24

End Date:

2027-07-01

Primary Classification:

10610: Bioinformatics and Computational Biology (Methods development to be 10203)

Secondary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Webpage:

Allocation

Arrhenius Disk at NAISS: 30000 GiB
Arrhenius Flash at NAISS: 5000 GiB
Arrhenius GPU at NAISS: 2000 GPU-h/month

Abstract

Multimodal representations of proteins have led to significant advances in our understanding of scientific discovery and its development. Current models such as ESM3 [1] and ProtT5 [2] primarily focus on protein sequence, structure, and function but perform poorly at predicting subcellular localization [3]. Subcellular protein localization is vital for understanding the function of different cellular systems and is essential for disease characterization and drug discovery. Often, disease-causing mutations lead to mislocalizations that can only be captured with microscopy. Hence, there’s a need for a more comprehensive multimodal representation of proteins, including their localization in the cell. In previous work [4; currently in review in Nature], we built a vision-only protein representation model and showed that it robustly learned protein localization patterns and also outperformed state-of-the-art models across various datasets for cell-cycle and drug perturbation prediction. In this work, we aim to enhance our understanding of proteins by building a multimodal protein representation model and an independent cell representation model that are agnostic to input channel combinations. We have collected over 2.2 million single-cell images of protein localization patterns of over 15k genes across more than 50 cell lines, across 5 different datasets. We have also collected the protein sequence information of those proteins. Altogether, the model will be trained on the biggest multimodal dataset for protein localization and representation. We aim to train a multimodal protein representation model that captures all multimodal aspects of proteins. The model will help better characterize the proteins by putting them in the context of cells and advance our understanding of diseases. The model will be extensively evaluated across a diverse set of vision and sequence tasks, and the benchmark will be released to the public to encourage further development in the field. Hayes, Thomas, et al. "Simulating 500 million years of evolution with a language model." Science 387.6736 (2025): 850-858. Elnaggar, Ahmed, et al. "ProtTrans: towards cracking the language of life’s code through self-supervised learning." IEEE Transactions on Pattern Analysis and Machine Intelligence 44 (2021): 7112-7127. Zoe Wefers, et al. “A comprehensive benchmark of sequence-based subcellular localization predictors for human proteins.” Submitted to Nature 2025. Gupta, Ankit, et al. "SubCell: Proteome-aware vision foundation models for microscopy capture single-cell biology." bioRxiv (2025): 2024-12.