SUPR
Development of statistical methods for analyses of NGS data
Dnr:

NAISS 2024/6-196

Type:

NAISS Medium Storage

Principal Investigator:

Yudi Pawitan

Affiliation:

Karolinska Institutet

Start Date:

2024-09-30

End Date:

2025-04-01

Primary Classification:

10610: Bioinformatics and Systems Biology (methods development to be 10203)

Allocation

Abstract

The overall goals of our project are to develop statistical and bioinformatics methodologies for analyses of high-throughput omics technologies, and apply these methods to prediction problems of common diseases such as cancer. We will develop model-based and computationally intensive methods for integrated processing, analysis and interpretation of multiple omics data. In this work, we consider the disease prediction problem in a very broad sense, ranging from raw data pre-processing and identification of relevant biomarkers to prediction of survival or response to therapy. Integration of complementary information from multiple levels of 'omics' data, including the genome, transcriptome, proteome and interactome such as gene interaction networks, can greatly facilitate the discoveries of true causes/drivers of a disease and its response to therapy. A large part of our method development is on the statistical and bioinformatics analyses of DNA and RNA sequencing data. Because of increased measurement complexities, sequencing data are susceptible to many types of technical noise and biases. Thus, our computational aims involve the analyses of raw DNA- and RNA-sequence data, including single-cell RNA-seq data, for the purpose of discovery of genomics alterations including somatic mutations, copy number changes and fusion events, and estimation of isoform-level expression. The key objectives of the computational steps are to reduce noise and biases, and to summarize the data prior to downstream analysis to correlate the molecular data with clinical phenotypes. The problem has become even more challenging with the growth of single-cell RNA sequencing. The single-cell data are more sparse compared to the standard bulk-cell or whole-tissue data. Previous methods need to be re-evaluated for their effectiveness to handle single-cell RNA sequence data.