This project will develop and run scalable, reproducible computational workflows to process multi-omics datasets and perform Mendelian randomization (MR) analyses aimed at identifying causal molecular mechanisms underlying metabolic and cognitive diseases. The work combines in-house sequencing data with publicly available resources to generate harmonized, analysis-ready datasets and robust causal inference results that can guide downstream biological validation.
A core component is the preparation of expression quantitative trait loci (eQTL) resources for MR. This includes systematic download, formatting, quality control, and harmonization of eQTL summary statistics (variant identifiers, alleles, effect sizes, standard errors, genome build consistency, and sample metadata). These steps are essential for reliable instrument selection and alignment with outcome genome-wide association study (GWAS) datasets. In parallel, we will process and analyze our own bulk RNA-seq and single-cell RNA-seq datasets to characterize disease-relevant transcriptional programs and cell-type-specific signatures linked to metabolic stress and cognitive phenotypes. Processing will include read-level quality control, alignment or pseudo-alignment, quantification, normalization, batch correction where appropriate, and downstream differential expression and pathway analyses. For single-cell data, we will perform standard workflows such as filtering, integration, clustering, cell-type annotation, marker detection, and signature scoring.
To expand statistical power and enable cross-study validation, we will download and process additional sequencing datasets from GEO and related public repositories, focusing on studies relevant to type 2 diabetes (T2D), metabolic stress/starvation contexts, and cognitive disease phenotypes in experimental models. We will also obtain and harmonize open GWAS summary statistics (including T2D and cognitive traits, where available) to serve as MR outcomes, applying consistent filters and variant harmonization across sources.
The requested NAISS compute time is primarily needed for large-scale preprocessing, multi-omics integration, and repeated MR analyses with sensitivity and robustness checks (e.g., heterogeneity tests, pleiotropy-robust estimators, leave-one-out analyses). These tasks are computationally intensive due to large input files, repeated harmonization steps across many exposure - outcome pairs, and the need for reproducible containerized pipelines.