Breast cancer profiling of primary tumours based on cancer-associated long noncoding RNAs (lncRNAs) and splice variants




Principal Investigator:

Johan Hartman


Karolinska Institutet

Start Date:


End Date:


Primary Classification:

30203: Cancer and Oncology


  • Castor /proj/nobackup at UPPMAX: 10000 GiB
  • Cygnus /proj/nobackup at UPPMAX: 10000 GiB
  • Castor /proj at UPPMAX: 5000 GiB
  • Cygnus /proj at UPPMAX: 5000 GiB
  • Bianca at UPPMAX: 5 x 1000 core-h/month


ClinSeq is an initiative for stratification of cancer patients using genomic profiling and an implementation of a novel genomics-based diagnostics in clinical care ( The data in ClinSeq consists of low-pass DNA-seq, RNA-seq, and targeted sequencing of pan-cancer panel of 484 genes from more than 300 patients. In this sub-project of ClinSeq, we aim to profile primary breast cancer tumors based on lncRNAs and splice variants. Specifically, the available 353 paired-end RNA-seq samples will be pre-processed using the NGI-RNAseq bioinformatics best-practice analysis pipeline, or a tailored adaptation of it (e.g. using different transcript model collection(s), like e.g. GENCODE, miTranscriptome (Lyer et al., Nature Genetics, 2015), and/or the recently published atlas of human lncRNAs with accurate 5’ ends (Chung-Chau et al., Nature, 2017), and/or using another method for quantification at the transcript-level, like e.g. RSEM (Li et al., BMC Bioinformatics, 2011), which has been shown to slightly outperform other methods (Teng et al., Genome Biology, 2016)). Resources: The computing time will be initially used for customizing to the projects’ specific needs and as a next step for running the RNA-seq pre-processing pipeline. Briefly, the raw reads/fragments will be trimmed, aligned, and quantified on both gene- and transcript-level, while extensive quality-control steps will be performed utilizing the basic and adapted instances of the NGI-RNAseq pipeline. More specifically, apart from the basic NGI-RNAseq pipeline (FastQC, TrimGalore, STAR, RSeQC, dupRadar, Preseq, featureCounts, StringTie, edgeR, and MultiQC), we will use RSEM and other R/Bioconductor packages. Also, we will run the NGI-RNAseq pipeline using different annotations, thus multiple outputs will be generated. Later on, computing time will be used for DE analysis, clustering, exploratory analysis and visualization, and/or other necessary analysis tasks. A single paired-end RNA-seq sample (2x ~2GB) pilot run of the NGI-RNAseq pipeline produced files of ~20GB (i.e. total of ~ 24GB per sample). The total number of RNA-seq samples is 353, thus, the estimated total size of data produced for running the basic NGI-RNAseq pipeline one time is ~8.5TB. Since we plan to run adapted versions of the NGI-RNAseq pipeline, too, an estimated size of 50TB for the ‘nobackup’ folder should be enough (all intermediate files will be stored to the ‘nobackup’ folder, and removed later on). For the project folder a minimum size of 10TB should be enough, so the raw FASTQ files, other necessary files for the analysis (e.g. several annotation files), selected intermediate files, and results could be saved. The pilot run using the NGI-RNAseq’s default UPPMAX configuration consumed ~100 core hours and completed in ~20 hours on milou. Thus, for all 353 samples an allocation of a minimum of 50 x 1,000 core hours per month is kindly requested. We would like to state that, since the project uses clinical samples from breast cancer patients, Bianca cluster is the best option for storing its data and running the project pipeline due to the cluster's enhanced security features.