Polyadenylation site (PA) choice defines transcript 3′ ends and constitutes a major regulatory layer that is cell-type- and state-specific. However, most existing APA/PA predictive models are primarily sequence-driven (cis-regulatory grammar), limiting their ability to explain tissue- and celltype-specific alternative polyadenylation (APA) events.
We propose PolyA-Foundation, a foundation model that learns transferable representations to predict PA usage from ATAC-seq chromatin accessibility peaks together with peak-overlapping regulatory sequence. The primary goal is to build a scalable deep learning framework that can generalize across human cell types and studies, enabling ATAC-only inference of PA programs. We will adopt a two-stage strategy: (i) large-scale self-supervised pretraining on public scATAC-seq atlases to learn regulatory representations from accessibility patterns and sequence features; and (ii) supervised fine-tuning on paired multi-omics datasets linking ATAC profiles to RNA-derived PA quantification.
Deliverables are: (1) a curated non-sensitive training corpus connecting ATAC-derived features to gene- and isoform-level PA usage across diverse human cell types; (2) a GPU-efficient pretraining and fine-tuning pipeline with pretrained checkpoints for ATAC-to-PA prediction; and (3) rigorous benchmarking and ablation studies (e.g., cross-dataset and held-out cell-type transfer) plus interpretation tooling. After establishing the model, we will use it as an analysis engine to quantify promoter–PA coupling and to prioritize enhancer/variant effects on APA by scoring predicted PA shifts, providing a framework to interpret putative APA-QTL mechanisms from regulatory DNA. We will pretrain on large-scale public scATAC-seq atlases from Zhang et al (https://doi.org/10.1016/j.cell.2021.10.024) and fine-tune/evaluate on Tabula Sapiens data (https://www.science.org/doi/10.1126/science.abl4896).
The project is compute-intensive due to large-scale pretraining and extensive benchmarking, and therefore targets Alvis for reproducible deep learning training and evaluation. All code, models, and evaluation scripts will be released openly.