NAISS
SUPR
NAISS Projects
SUPR
TAIGA
Dnr:

NAISS 2026/3-42

Type:

NAISS Medium

Principal Investigator:

Jian-Feng Mao

Affiliation:

UmeƄ universitet

Start Date:

2026-01-29

End Date:

2027-02-01

Primary Classification:

10203: Bioinformatics (Computational Biology) (Applications at 10610)

Secondary Classification:

10610: Bioinformatics and Computational Biology (Methods development to be 10203)

Tertiary Classification:

40401: Plant Biotechnology

Webpage:

Allocation

Abstract

TAIGA addresses two fundamental challenges in modern plant biology research: the integration bottleneck and the limitations of LLM-based querying systems. Traditional approaches require researchers to manually navigate 23+ disconnected databases (TAIR, Araport11, UniProt, AraCyc, ChIP-Hub, AGRIS, etc.), a process that is time-consuming and error-prone. Meanwhile, using LLMs alone for querying biological databases suffers from critical problems: hallucinations (generating non-existent facts), lack of grounding in actual data, inability to access real-time database information, and generation of incorrect or unexecutable queries. This project develops an agentic graph application: a unified Neo4j knowledge graph that integrates fragmented Arabidopsis thaliana data into a single queryable database, combined with autonomous AI agents that explore the graph intelligently. The platform currently contains 2.1 million nodes and 15 million relationships, representing genes, proteins, pathways, regulatory networks, expression data, phenotypes, and metabolic information from 20+ authoritative data sources. The agentic approach enables autonomous exploration, iterative query refinement, and multi-step reasoning across the knowledge graph. The platform features three core components: (1) Graph Storage - a Neo4j database with comprehensive data integration from TAIR, Araport11, ChIP-Hub, AGRIS, Expression Atlas, and other sources; (2) Network Science Analysis - advanced graph algorithms for centrality measures, community detection, path analysis, and gene discovery; (3) Agentic Graph Exploration - autonomous AI agents that use LLM API services to generate queries, validate them against the graph schema, execute them, analyze results, and iteratively refine their exploration strategy based on discovered patterns. Key advantages of the agentic approach include: (1) Autonomous multi-hop reasoning - agents can explore complex paths across genes, proteins, pathways, and phenotypes without manual query construction; (2) Self-correcting queries - agents validate and refine queries based on execution results and schema feedback; (3) Iterative exploration - agents can follow interesting patterns discovered in initial queries, enabling serendipitous knowledge discovery; (4) Context-aware reasoning - agents maintain conversation context and build upon previous discoveries; (5) Error recovery - agents can detect query failures and automatically generate alternative approaches. Key computational workflows include: Neo4j graph database operations for complex Cypher queries across millions of nodes; LLM API services for agent decision-making and query generation (with graph validation to ensure query correctness); The platform solves the LLM hallucination problem by: (1) validating all generated Cypher queries against the graph schema before execution; (2) retrieving answers directly from Neo4j database queries rather than LLM text generation; (3) grounding agent reasoning in 69,554 validated PubMed abstracts via RAG; (4) ensuring entity existence before generating questions about them; (5) enabling agents to verify their own discoveries through follow-up queries. This agentic approach provides the natural language interface and autonomous exploration benefits of AI while maintaining scientific accuracy through database-grounded retrieval. The platform serves the plant biology research community by providing an autonomous interface to explore gene interactions, regulatory networks, metabolic pathways, and phenotypic associations. The system is designed for both interactive web-based exploration and programmatic API access, supporting diverse research workflows from single-gene analysis to large-scale network studies.