De-novo spruce and pine genome annotation and comparative analyses

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/5-440

Type:

NAISS Medium Compute

Principal Investigator:

Nicolas Delhomme

Affiliation:

Umeå universitet

Start Date:

2024-09-04

End Date:

2025-10-01

Primary Classification:

40104: Forest Science

Secondary Classification:

10610: Bioinformatics and Systems Biology (methods development to be 10203)

Tertiary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Webpage:

https://www.upsc.se/platforms/upsc-bioinformatics-platform.html

Allocation

Dardel at PDC: 100 x 1000 core-h/month

Abstract

The project SNIC 2023/5-323 has been instrumental for our ongoing KAW-funded project sequencing the Norway spruce and Scots pine genomes, two species of essential ecological and economic importance to Sweden. Our computational needs have not always matched our yearly allocation (82%) this year, due to the nature of the project, alternating data generation, data processing and data analysis, but we came fairly close. We are now again expecting again a computationally intensive phase for ongoing comparative genomics, epigenetics, gene family and GWAS analyses. Spruce and pine are the two economically and ecologically dominant species in Sweden, but adequate genomic resources are so far lacking due to their genomes being large and complex. This compute project is a part of the 6-year 80 MSEK KAW strategic investment program at Umeå Plant Science Centre (UPSC) and Science for Life Laboratory (SciLifeLab) to maintain Swedish conifer research at the absolute international forefront. In previous years, using SNIC resources, we have successfully assembled two conifer genomes based purely on a long-read sequencing technology. Our Norway spruce assembly has the expected 12 chromosomes and a quality adequate for submission to the Vertebrate Genome Project. The new Norway spruce assembly in combination with this data is what we have been using the last years, with our allocation, in combination to a massive compendium of almost 2000 RNA-Seq samples, for annotating the gene space. This large amount of data is what has been processed in the last years and this process is now complete. As we anticipated last year, it is providing us with the most comprehensive view of the gene space to date, both coding and non-coding, enabling us to much better understand the gene regulation down to the transcript level, empowering us with analyses such as differential transcript expression and differential transcript usage, that were to date limited to model organisms. This is a fantastic achievement and a huge prospect that will provide Sweden and its forestry industry, foremost breeders with a significant advantage. We have last year started integrating the other data we have generated, including 3D chromatin conformation, chromatin accessibility, histone marks to further shed light on gene regulatory and molecular mechanisms that explain the conifer genome size "bulimia", as well as stress mechanisms (cold and drought) that are highly relevant in a changing climate and key for breeders to develop elite trees to sustain the Swedish forestry. In addition to the annotation and integrative analyses efforts, we spent a significant amount of the allocation working on population genomics to uncover genetic variation underpinning traits of major importance for forestry and breeding, further processing if for pine and starting comparing it to the spruce results from last year. To conclude, the continuation of project SNIC 2023/5-323 will be instrumental to our ongoing effort, where we will need large amount of computation for the spruce and pine comparative analysis, including the population study, novel datasets (epigenetics) and comprehensive gene family analyses.