SUPR
Multiomics analysis of pediatric AML - #1
Dnr:

sens2018102

Type:

SNIC SENS

Principal Investigator:

Linda Holmfeldt

Affiliation:

Uppsala universitet

Start Date:

2018-02-16

End Date:

2024-07-01

Primary Classification:

30203: Cancer and Oncology

Allocation

  • Castor /proj at UPPMAX: 12500 GiB
  • Cygnus /proj at UPPMAX: 12500 GiB
  • Castor /proj/nobackup at UPPMAX: 1500 GiB
  • Cygnus /proj/nobackup at UPPMAX: 1500 GiB
  • Bianca at UPPMAX: 13 x 1000 core-h/month

Abstract

We are currently performing a multilevel analysis of initial diagnosis, primary refractory and relapse acute myeloid leukemia (AML) specimens from adult patients, using the resources of UPPMAX (Project # sens2017148 (previously b2017040) and sens2017604 (previously b2017041)) as well as various SciLifeLab core facilities. In order to get a more complete picture of the leukemic cells that previously never has been generated; we are performing studies at the genome, epigenome, transcriptome and proteome level, followed by a systems biological approach for full integration of the data sets. This current SNIC SENS application focuses on a corresponding project that we currently are awaiting data for, studying specimens from pediatric AML patients instead of adults. We are generating a rather large data set, with whole genome sequencing (WGS) data, transcriptome sequencing (RNA-seq) data, genome wide DNA methylation data as well as proteome data from high-resolution mass spectrometry analysis. Due to the relatively large raw data amount (estimated to reach a total of approximately 15 TB), as well as data sets that will be analyzed by different people, this project will be split up in two, with the other corresponding project being “sens2018512”. Resource Usage: This part of the overall project contains sensitive personal data, including whole genome sequencing (WGS) data from primary human leukemic and normal specimens from 153 HiSeqX lanes (PE 150bp), corresponding to approximately 12TB of raw data. These data will be delivered as BAM files, and a limited expansion of the raw data will be needed, based on the following work flow: BAM files (raw data) > quality control > Tumor/Normal variant calling (SNVs, CNVs and SVs) > experimental analysis data (recreation of newly published methods etc.), reaching a total of approximately 13.5TB of data, including 1TB in nobackup. Data analysis (variant calling and filtering) will be performed using STRELKA, MANTA, CTRLfreec, GATK, PICARD, NIRVANA and ANNOVAR. These analyses are extremely burst related. Initially, we need approximately 1000 core hours per month per TB of data (i.e. approximately 13,000 core hours). After the initial burst (lasting approximately 1-2 months), however, significantly less is needed. Approximately 80% of the core hours go to different types of variant calling.