This project focuses on the development and benchmarking of computational methods for cancer phylogeny inference using single-cell, bulk and spatial sequencing data. Our goal is to improve the reconstruction of tumor evolutionary histories, clonal architectures, and copy-number/variant trajectories by designing and validating novel statistical and algorithmic approaches.
All analyses are conducted exclusively on publicly available, non-sensitive datasets, primarily obtained from international repositories such as the Sequence Read Archive (SRA) and similar open resources. These datasets are distributed in raw sequencing formats (e.g., FASTQ/BAM/CRAM), which are inherently large in size. For rigorous benchmarking, we must download, store, preprocess, and repeatedly re-analyze multiple large datasets spanning different cancer types, technologies (bulk vs. single-cell), and experimental conditions.
Significant storage capacity is therefore required for:
Temporary and long-term storage of raw sequencing files,
Intermediate products such as aligned reads, variant calls, copy-number profiles, and phased haplotypes, outputs from large-scale simulation and benchmarking experiments, reproducible versioned datasets supporting method comparison and validation.
The storage infrastructure is a critical enabling component of this project, as method development requires iterative reprocessing of the same large datasets under multiple parameter settings and competing models. Without sufficient storage, it would not be possible to maintain reproducibility, traceability, or fair benchmarking across methods.
This project does not handle any personal, clinical, or identifiable patient information beyond what is already released under public-access policies of the original data repositories.