NAISS
SUPR
NAISS Projects
SUPR
Storage project for single cell and bulk tumor phylogenetic reconstruction
Dnr:

NAISS 2025/6-472

Type:

NAISS Medium Storage

Principal Investigator:

Jens Lagergren

Affiliation:

Kungliga Tekniska högskolan

Start Date:

2026-01-01

End Date:

2026-12-01

Primary Classification:

10201: Computer Sciences

Secondary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Tertiary Classification:

30203: Cancer and Oncology

Allocation

Abstract

This project focuses on the development and benchmarking of computational methods for cancer phylogeny inference using single-cell, bulk and spatial sequencing data. Our goal is to improve the reconstruction of tumor evolutionary histories, clonal architectures, and copy-number/variant trajectories by designing and validating novel statistical and algorithmic approaches. All analyses are conducted exclusively on publicly available, non-sensitive datasets, primarily obtained from international repositories such as the Sequence Read Archive (SRA) and similar open resources. These datasets are distributed in raw sequencing formats (e.g., FASTQ/BAM/CRAM), which are inherently large in size. For rigorous benchmarking, we must download, store, preprocess, and repeatedly re-analyze multiple large datasets spanning different cancer types, technologies (bulk vs. single-cell), and experimental conditions. Significant storage capacity is therefore required for: Temporary and long-term storage of raw sequencing files, Intermediate products such as aligned reads, variant calls, copy-number profiles, and phased haplotypes, outputs from large-scale simulation and benchmarking experiments, reproducible versioned datasets supporting method comparison and validation. The storage infrastructure is a critical enabling component of this project, as method development requires iterative reprocessing of the same large datasets under multiple parameter settings and competing models. Without sufficient storage, it would not be possible to maintain reproducibility, traceability, or fair benchmarking across methods. This project does not handle any personal, clinical, or identifiable patient information beyond what is already released under public-access policies of the original data repositories.