A Foundation Model of Alternative Polyadenylation Across Human Cell Types

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/3-225

Type:

NAISS Medium

Principal Investigator:

Niklas Mejhert

Affiliation:

Karolinska Institutet

Start Date:

2026-03-30

End Date:

2027-04-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Secondary Classification:

10210: Artificial Intelligence

Tertiary Classification:

10609: Genetics and Genomics (Medical aspects at 30107 and agricultural at 40402)

Webpage:

Allocation

Mimer at C3SE: 6000 GiB
Arrhenius Disk at NAISS: 3000 GiB
Alvis at C3SE: 2000 GPU-h/month
Arrhenius Flash at NAISS: 1500 GiB

Abstract

Polyadenylation site (PA) choice defines transcript 3′ ends and constitutes a major regulatory layer that is cell-type- and state-specific. However, most existing APA/PA predictive models are primarily sequence-driven (cis-regulatory grammar), limiting their ability to explain tissue- and celltype-specific alternative polyadenylation (APA) events. We propose PolyA-Foundation, a foundation model that learns transferable representations to predict PA usage from ATAC-seq chromatin accessibility peaks together with peak-overlapping regulatory sequence. The primary goal is to build a scalable deep learning framework that can generalize across human cell types and studies, enabling ATAC-only inference of PA programs. We will adopt a two-stage strategy: (i) large-scale self-supervised pretraining on public scATAC-seq atlases to learn regulatory representations from accessibility patterns and sequence features; and (ii) supervised fine-tuning on paired multi-omics datasets linking ATAC profiles to RNA-derived PA quantification. Deliverables are: (1) a curated non-sensitive training corpus connecting ATAC-derived features to gene- and isoform-level PA usage across diverse human cell types; (2) a GPU-efficient pretraining and fine-tuning pipeline with pretrained checkpoints for ATAC-to-PA prediction; and (3) rigorous benchmarking and ablation studies (e.g., cross-dataset and held-out cell-type transfer) plus interpretation tooling. After establishing the model, we will use it as an analysis engine to quantify promoter–PA coupling and to prioritize enhancer/variant effects on APA by scoring predicted PA shifts, providing a framework to interpret putative APA-QTL mechanisms from regulatory DNA. We will pretrain on large-scale public scATAC-seq atlases from Zhang et al (https://doi.org/10.1016/j.cell.2021.10.024) and fine-tune/evaluate on Tabula Sapiens data (https://www.science.org/doi/10.1126/science.abl4896). The project is compute-intensive due to large-scale pretraining and extensive benchmarking, and therefore targets Alvis for reproducible deep learning training and evaluation. All code, models, and evaluation scripts will be released openly.