SUPR
De-novo spruce and pine genome PacBio data assembly
Dnr:

NAISS 2023/5-323

Type:

NAISS Medium Compute

Principal Investigator:

Nicolas Delhomme

Affiliation:

Umeå universitet

Start Date:

2023-07-01

End Date:

2024-07-01

Primary Classification:

40104: Forest Science

Secondary Classification:

10610: Bioinformatics and Systems Biology (methods development to be 10203)

Tertiary Classification:

10203: Bioinformatics (Computational Biology) (applications to be 10610)

Allocation

Abstract

The project SNIC 2022/5-342 has been instrumental for our ongoing KAW-funded project sequencing the Norway spruce and Scots pine genomes, two species of essential ecological and economic importance to Sweden. Our computational needs have not always matched our yearly allocation (70%) this year, due to the nature of the project, alternating data generation, data processing and data analysis. We are now again, since the last four months in a computationally intensive phase having come close or even exhausted our allocation on rackham. Spruce and pine are the two economically and ecologically dominant species in Sweden, but adequate genomic resources are so far lacking due to their genomes being large and complex. This compute project is a part of the 6-year 80 MSEK KAW strategic investment program at Umeå Plant Science Centre (UPSC) and Science for Life Laboratory (SciLifeLab) to maintain Swedish conifer research at the absolute international forefront. In previous years, using SNIC resources, we have successfully assembled two conifer genomes based purely on a long-read sequencing technology. Our Norway spruce assembly has the expected 12 chromosomes and a quality adequate for submission to the Vertebrate Genome Project. The new Norway spruce assembly in combination with this data is what we have been using last year, with our allocation, in combination to a massive compendium of almost 2000 RNA-Seq samples, for annotating the gene space. This large amount of data is what has been processed this year and this process is reaching completion. As we anticipated last year, it is providing us with the most comprehensive view of the gene space to date, both coding and non-coding, enabling us to much better understand the gene regulation down to the transcript level, empowering us with analyses such as differential transcript expression and differential transcript usage, that were to date limited to model organisms. This is a fantastic achievement and a huge prospect that will provide Sweden and its forestry industry, foremost breeders with a significant advantage. We will now work on integrating the other data we have generated, including 3D chromatin conformation, chromatin accessibility, histone marks to further shed light on gene regulatory and molecular mechanisms that explain the conifer genome size "bulimia", as well as stress mechanisms (cold and drought) that are highly relevant in a changing climate and key for breeders to develop elite trees to sustain the Swedish forestry. In addition to the genomic and annotation efforts, we spent a significant amount of the allocation working on the second phase of the project, namely on population genomics to uncover genetic variation underpinning traits of major importance for forestry and breeding, completing it for spruce and processing with pine. To conclude, the continuation of project SNIC 2022/5-342 will be instrumental to our ongoing effort, where we will need large amount of computation for the spruce final data analysis, the pine annotation, and the integration of the the spruce and pine results from the population study.