Our research aims to identify regulatory elements in the human genome with particular focus on metabolic diseases. We have developed a bioinformatics pipeline that utilizes ChIP-seq raw data to pinpoint candidate regulatory elements genome-wide characterized by an allele-specific binding of transcription factors/histone modifications. We perform ChIP-seq and RNA-seq experiments in human tissue samples that will provide the starting data to identify tissue specific regulatory elements. In order to identify which heterozygous positions in the genome of the samples we are studying are showing an allele specific binding we will perform whole genome sequencing of the human tissue samples (either via Illumina or 10X Genomics). We also verify the regulatory activity of the candidate functional elements identified looking at changes in genes expression profiles using RNA-seq.
Resource Usage
The computing and storing space we are applying for on Bianca will be used to store raw data from WGS of individual samples, ChIP-seq, RNA-seq and other high throughput sequencing techniques as well as public data which will be downloaded from major consortia such as the ENCODE project, the Roadmap Epigenomics project or the GTEX project.
Several data analyses pipelines will be run to analyze the data (e.g. NGI-ChIP-seq pipeline)
Our data are currently divided into two Milou project: b2010003 (~5Tb) and b2015355 (~2Tb). We are here applying for a SNIC SENS Small project on Bianca including 5000 core-hours/month of computational time and 20 Tb of storage.
As mentioned in the Abstract the direction of the project will lead us in the next 1-3 years to sequence and store several NGS dataset, the majority of which from experiments performed on human tissue samples requiring storage in Bianca.
Examples of storage needs:
********************************************************
Chromium linked-read WGS (10X Genomics) of 1 liver samples at ~36x coverage : 73 GB
Longranger pipeline (reference aligning and SNP calling) : 106 GB
********************************************************
Pair-end sequencing of 1 ChIP experiment for 1 histone modification (2x .fastq) : 2.8 GB
NGI-ChIP-seq pipeline output: 20.7 GB
********************************************************
Estimate storage cost for a “work unit” including WGS of 1 liver sample plus ChIP-seq analysis of 5 histon modifications/TF in duplicates plus pipelines running and downstream analysis (annotations
• 73 GB
• 106 GB
• 2.8 x 2 replicas x 5 hist mod/TFs = 28 GB
• ~5 GB
Tot: ~210 GB
We are expecting to produce, analyze and store several of these “work units” in the coming years. We are planning to provide a degree of turnover with the older data no longer actively used for the current stage of the project that will be locally stored in external hard drives and we consider the limit of 20 Tb for the SNIC SENS Small project suitable for our needs.