Uncovering Proteoforms from Shotgun Proteomics Data

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/23-593

Type:

NAISS Small Storage

Principal Investigator:

Yuqi Zheng

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-11-03

End Date:

2026-12-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Webpage:

Allocation

Centre Storage at NSC: 4000 GiB

Abstract

The central dogma of molecular biology describes the flow of genetic information from DNA to proteins: DNA is transcribed into RNA, and RNA is translated into proteins. In practice, however, this process is highly intricate, involving numerous variations that give rise to a diverse range of protein products from each gene. Each distinct molecular form of a protein originating from a single gene is referred to as a proteoform, reflecting differences due to genetic variation, alternative RNA splicing, and post-translational modifications. Our goal is to investigate the proteoform landscape from multiple perspectives. We have two valuable resources. First, we have access to a deep proteomics dataset; second, we utilize extensive long-read mRNA-seq data that capture transcript abundance and variation in the same cell line. We have been comprehensively investigating whether transcript variants detected through long-read sequencing, which differ from canonical UniProt transcript definitions, can also be detected in deep proteomics data, including integrative annotations and analyses across molecular levels spanning genes, transcripts, exons, peptides, and proteoforms. A further challenge lies in inferring proteoform composition and abundance from peptide-level data. Owing to the Humpty-Dumpty problem, that a peptide can originate from multiple proteoforms, the measurement of an individual peptide is the convolution of all proteoforms in which it occurs. Most existing software tools only aggregate peptides into protein groups. We have SEC-MS peptide abundance data, where different proteoforms are expected to be present across different fractions. We aim to develop a probabilistic model to infer proteoform compositions and their abundances from peptide-level measurements.