Two tasks remain unfinished from our 2022-2023 project, i) investigating mitochondrial heteroplasmy, ii) realigning RNA sequencing catalogues to UU_Cfam_GSD_1.0/canFam4. During the 2022-2023 project we generated population level variation (single base and structural) data for ~2,000 domestic and wild canids in reference to the Uppsala University constructed domestic dog reference, UU_Cfam_GSD_1.0/canFam4 [1]. In August 2023 we published the variation catalogue from the nuclear genome [2], and the consensus haplotypes from the mitochondria. Each of these resources was also used to investigate topics of morphology, disease susceptibility, and genome architecture and function. However there were two analysis tasks that we were unable to complete during the life of the Uppmax project.
1. Heteroplasmic analysis of the mitochondrial genome
We specifically processed the ~ 2,000 mitochondrial genomes following the gnomAD mtDNA pipeline [3], using a modified mitochondrial reference genome, BWA-MEM aligner and GATK Mutect2 for variant calling. What remains is to finish the QC phase of the pipeline, the filtering for contamination and false positives from misalignment. At present there are no large scale resources for mitochondrial heteroplasmy in dogs, there is an increasing catalogue of mitochondrial diseases (e.g. summarised in [4]) The result will be a resource that the community can use for clinical or diverse canine questions.
2. RNA Seq annotation of UU_Cfam_GSD_1.0/canFam4
Due to time constraints, the 2022-2023 Dog10K analyses used the gene models from the NCBI annotation to evaluate the variant effects. While we started to examine variant impact with our inhouse long read annotation [1], we were not able to complete this. We will now use both our in-house and recent RNA-seq public data, such as barkbase [5] and Epic dog [6], to improve the quality of NCBI annotation by providing additional transcript information for both coding and non-coding genes.
The vcf files generated via these analyses will be available to the whole genomics community without embargo. Variation data will be hosted on UCSC and at additional sites (e.g. ENA). The result will be a panel of normal variation, free available for the community to use. Our motivation for this project has its focus in health and medicine, both for the dog and as the dog for model of human health. It is expected that this data will rapidly aid the translation of variant association to causation through its use in variant elimination and the dissection of gene tolerance to variation.
References:
[1]Wang https://doi.org/10.1038/s42003-021-01698-x
[2] Meadows https://doi.org/10.1186/s13059-023-03023-7
[3] Laricchia https://doi.org/10.1101/gr.276013.121
[4] Tkaczyk-Wlizło https://doi.org/10.1016/j.mito.2022.02.001
[5] Megquier https://doi.org/10.3390/genes10060433
[6] Son https://doi.org/10.1126/sciadv.ade3399