Large-Scale Machine Learning for Single-Cell Data, Cancer and Gene Regulatory Network Inference

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1610

Type:

NAISS Small Compute

Principal Investigator:

Jens Lagergren

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2025-12-01

End Date:

2026-12-01

Primary Classification:

10610: Bioinformatics and Computational Biology (Methods development to be 10203)

Webpage:

https://lagergrenlab.org/

Allocation

Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB
Tetralith at NSC: 10 x 1000 core-h/month

Abstract

We are a group of researchers affiliated with WASP (Wallenberg AI, Autonomous Systems and Software Program), SciLifeLab, and KTH Royal Institute of Technology, working at the intersection of deep learning methodology, probabilistic modeling, and data-driven life science. In our projects, we collaborate in highly interdisciplinary teams that combine understanding of experimental technologies and pertinent medical questions with competence in developing probabilistic models, designing machine learning algorithms, and implementing them on a wide range of HPC hardware and frameworks. Aside from devising computational methods, we apply our methods to data generated in collaborative projects and work closely with other cutting-edge single-cell groups. One central line of work develops and evaluates a novel framework for general-purpose large-scale causal discovery. Discovering causal relationships in high-dimensional, non-linear systems remains a core challenge in scientific inference. We introduce Bayesian Scalable Differentiable Causal Discovery (BSDCD), a method designed to recover large causal graphs with quantified uncertainty. BSDCD builds on a stable differentiable optimization framework that enforces acyclicity through a spectral constraint and accommodates sparse, large-scale structures. To capture uncertainty in the inferred causal graph, we incorporate a Laplace-based Bayesian approximation centered on the Hessian of the spectral radius constraint, yielding posterior edge inclusion probabilities with minimal computational overhead. On synthetic datasets involving hundreds of nodes, BSDCD significantly outperforms conventional maximum likelihood approaches, delivering superior accuracy in structure recovery and more reliable average treatment effect (ATE) estimates. Applying BSDCD to large-scale CRISPR perturbation data covering hundreds of genes, we identify reproducible gene regulatory interactions across independent datasets, demonstrating that BSDCD can scale causal discovery to complex biological systems while providing principled measures of confidence and bridging differentiable modeling and Bayesian inference in causal structure learning. In parallel, we run large-scale projects on phylogenetics and single-cell analysis. One project focuses on Variational Bayesian Inference for Phylogenetics (VBPI) and SBN mixtures, and another on inference of tumor subclonal copy number events from single-cell data. We use Tetralith projects for storage of synthetic model-generated data and benchmark real datasets. With Tetralith’s parallel processing capabilities and multiple CPU cores, we run a large number of experiments efficiently, achieve state-of-the-art results in phylogenetic inference, and store important data for analysis. Tetralith’s strength in parallelization is key to these efforts and aligns with our broader need for scalable HPC resources to support causal discovery, phylogenetics, and single-cell tumor analysis within our interdisciplinary research program.