NAISS
SUPR
NAISS Projects
SUPR
Transposable element annotation
Dnr:

NAISS 2025/22-1638

Type:

NAISS Small Compute

Principal Investigator:

Zhao-Yang Chen

Affiliation:

UmeƄ universitet

Start Date:

2025-12-12

End Date:

2027-01-01

Primary Classification:

10799: Other Natural Sciences

Webpage:

Allocation

Abstract

Transposable elements (TEs) constitute a substantial portion of plant genomes, often comprising 50-85% of genomic content in many species. Despite their prevalence, TEs remain poorly annotated in most plant genomes, limiting our understanding of genome evolution, gene regulation, and adaptation mechanisms. Comprehensive TE annotation across multiple plant species is essential for several critical reasons: (1) TEs play pivotal roles in genome size variation, chromosomal rearrangements, and species diversification; (2) TE insertions near genes can significantly impact gene expression and create novel regulatory networks; (3) comparative TE analysis across species reveals evolutionary patterns and lineage-specific amplification events; (4) accurate TE annotation is fundamental for improving gene prediction accuracy and genome assembly quality; and (5) understanding TE dynamics is crucial for crop improvement and breeding programs. This project aims to perform systematic de novo TE annotation across multiple plant species using EDTA (Extensive de novo TE Annotator), a state-of-the-art comprehensive pipeline. EDTA integrates multiple specialized tools including LTR_retriever, TIR-Learner, HelitronScanner, and TEsorter to identify and classify all major TE categories: LTR retrotransposons, TIR transposons, Helitrons, and non-autonomous elements (MITEs and SINEs). The pipeline employs both structure-based and homology-based approaches to ensure high-quality annotations. Given the computational intensity of genome-scale TE annotation, this project requires substantial computing resources. Multi-species genome analysis demands significant storage capacity to accommodate: (1) raw genome assemblies ranging from 200 MB to several GB per species; (2) intermediate files generated during TE discovery processes; (3) multiple TE libraries and annotation databases; and (4) final curated TE libraries and GFF3 annotation files. Additionally, the computational workflow is CPU-intensive, requiring parallel processing capabilities for: (1) BLAST similarity searches against TE databases; (2) multiple sequence alignments and clustering operations; (3) machine learning-based TE classification; and (4) iterative refinement and validation steps. Each species typically requires 50-200 CPU hours depending on genome size and complexity. Expected Outcomes: Through this project, we anticipate achieving the following deliverables: Comprehensive TE Database: Complete de novo TE annotation for multiple plant species, generating high-quality, curated TE libraries with detailed structural and classification information. This database will serve as a valuable resource for the plant genomics community. Functional Genomics Foundation: Provide robust datasets that will facilitate downstream functional genomics research, including gene expression analysis, epigenetic studies, and genome evolution investigations. The annotated TEs will enable researchers to better understand gene regulation mechanisms and identify TE-derived regulatory elements. This project will significantly advance our understanding of plant genome architecture and evolution while establishing essential computational infrastructure for future comparative genomics studies.