Modern single-cell and multiomic datasets require complex, multistep analysis
workflows involving sparse matrices, trajectory models, differential expression,
signature scoring, artifact management, and scientific reporting. At the same
time, large language models make it possible to build interactive analysis
agents, but unconstrained agents that execute arbitrary code are difficult to
audit and unsuitable for sensitive or unpublished biological data.
This project will develop and evaluate an agent-orchestrated analysis system for
single-cell and multiomic trajectory analysis. The system separates a canonical
analysis engine, scworkbench, from a Python LLM orchestration harness,
scwb-agent. The analysis engine exposes typed scientific operations over
portable analysis bundles, while the agent harness is restricted to typed tools
and records full provenance for every action.
The project preserves a strict separation between the canonical analysis engine and an
LLM-based orchestration layer. The scworkbench engine, already developed by PI De Weerd of this project, provides portable analysis bundles, typed scientific operations, promoted artifacts, and provenance records. The scwb-agent harness, also already in development, exposes only typed tools to the language model, records every tool call and result, and writes reports linked to engine provenance. The agent is not allowed to run arbitrary shell commands or
free-form R code.
The main computational tasks are:
- large-scale single-cell and multiomic analysis over portable scworkbench
bundles
- repeated operation runs for differential expression, signature scoring,
trajectory inference, smoothing, model projection, motif analysis, and
benchmarking
- local inference with open-weight instruction-tuned LLMs for planning,
tool selection, error recovery, and report drafting
- evaluation of agent behavior under constrained typed-tool access
- containerized reproducibility tests across CPU, GPU, and sensitive-data
partitions
The project does not aim to pretrain a foundation model. The LLM workload is
primarily inference and possibly lightweight parameter-efficient adaptation or
preference tuning of existing open models. The biological workloads are
substantial because they operate on large sparse matrices, feature spaces,
trajectory objects, model artifacts, and repeated analysis DAGs.