SUPR
Development of a Generalized Scoring Tool for Pathogenic Rare Variant Detection in Non-Coding RNA Genes and UNICORN Regions
Dnr:

NAISS 2025/22-498

Type:

NAISS Small Compute

Principal Investigator:

Daniel Nilsson

Affiliation:

Karolinska Institutet

Start Date:

2025-03-24

End Date:

2026-04-01

Primary Classification:

10610: Bioinformatics and Computational Biology (Methods development to be 10203)

Webpage:

Allocation

Abstract

Around 40-50% of all rare genetic diseases today can be explained by variants in protein coding regions of the genome yet still half of the cases do not have a genetic explanation. The protein coding parts of the genome is estimated to be around 2% of the genome. At the same time around 70-90% of the genome is transcribed into RNA and therefore the next step for research of rare genetic diseases is to evaluate variants that can be found in non coding parts of the genome. Known regions outside of protein coding that are functional are regulatory elements such as non-coding RNA genes, many of which regulate protein coding genes. Among these there are some 20 known disease associated genes. The problem today in interpreting variants in the non-protein coding parts is that the majority of research has been done primarily directly on protein coding and therefore methods of evaluating other parts of the genome for pathogenicity are lacking. There is also a lack of a scoring framework that encompasses many non-coding RNA genes together outside of independent projects that focus primarily on single genes. ACMG, a widely recognized medical genetics organization has given recommendations on how to interpret the non-coding parts but there exists little computerized framework on utilizing such interpretation for automatized ranking and scoring and in the end utilization on a mass scale in healthcare. This project will focus on using methods known today such as conservation scores, disease associations and functional prediction to create a generalized scoring tool for research use in evaluating potential pathogenicity in other parts of the genome than the protein coding. The focus will be on non-coding RNA genes such lncRNA and regions in the genome that are conserved throughout mammals. This will help reveal the current flaws in what is needed for assessment of pathogenicity and ranking of variants of other parts of the genome as well as produce a base from where further work can be done to integrate this into clinical use. To achieve this goal, a number of open source tools will be used, for example known databases with variant frequencies such as gnomAD, conservation scores, computer prediction techniques for assessment of variants and criteria from professional in interpreting pathogenicity. This will then be compiled in a program that automatizes this and ranks variants from a given genome. Processing of data will be done via standard bioinformatic variant calling pipelines but in the beginning primarily, genomes in a bottle that are already processed will be used. All development and testing will be with such publicly available, non-sensitive data.