Cell-free DNA (cfDNA) sequencing from blood samples is revolutionising many areas of medicine, from cancer detection to transplant monitoring. It will likely become our number one tool for tackling treatment resistance in cancer, as for the first time ever it is possible to almost real-time monitor the progression of the disease. However, a crucial limitation of this method is that in cell-free DNA, most of the sequenced DNA comes from healthy cells, with often only a very small percentage originating from tumour cells that carry a genetic alteration distinguishing them from healthy cells. If there are different types of cells (e.g. sensitive and resistant to therapy) within the tumour, they contribute an even smaller percentage to the sequenced DNA. Therefore, our ability to track how the tumour changes over time – its overall amount and/or its composition – is strongly hindered by the low amount of signal in typical cfDNA data.
Somatic copy number alterations (CNAs, e.g. losses/gains of whole chromosomes) are widespread in cancers and offer a great tool to track tumour composition as (i) CNAs are typically exclusive to tumour cells (while mutations can be found in healthy cells as well), and (ii) they can be evaluated using remarkably cheap low-pass whole-genome sequencing (lpWGS). However, at low cancer-proportions, it becomes challenging to clearly distinguish CNAs (i.e. significantly different amount of genetic material) from random variations in the total number of sequencing reads mapping to a genomic location.
The aim of this project is to contribute to enhancing signal-to-noise ratio in CNA data through characterising and subtracting the measurement noise that accompanies liquid biopsy and low-pass sequencing. To this aim, we developed a denoising method based on convolutional neural networks that aim to reconstruct the underlying CNA profile from noise-ridden low-purity data. This method will be tested through an automated bioinformatic pipeline applied to realistic simulated and gold-standard cancer cell line data.