NAISS
SUPR
NAISS Projects
SUPR
Storage for MR-RATE Brain/Spine MRI Dataset for Foundation Model Research
Dnr:

NAISS 2026/4-800

Type:

NAISS Small

Principal Investigator:

Filippo Ruffini

Affiliation:

Umeå universitet

Start Date:

2026-04-27

End Date:

2027-05-01

Primary Classification:

10210: Artificial Intelligence

Allocation

Abstract

This project requests 5 TB of storage on NAISS infrastructure in support of DIGILUNG (AI-boosted DIGital twIn for personaLized lUNG cancer care), a research project funded by the Kempe Foundation and led by Prof. Paolo Soda at Umeå University, Department of Diagnostics and Intervention. DIGILUNG develops AI-driven digital twins for personalized non-small-cell lung cancer (NSCLC) care, targeting three interconnected objectives: generative virtual scanning through cross-modality image synthesis, multimodal AI for treatment outcome prediction and prognosis, and explainable and trustworthy data interfaces between the physical patient and the virtual twin. The requested storage will host a curated subset of MR-RATE, a large-scale multimodal neuroimaging dataset comprising 705,254 brain and spine MRI volumes from 83,425 unique patients paired with anonymized radiology reports and acquisition metadata. This dataset constitutes the principal training and evaluation resource for two research directions central to DIGILUNG. The first and primary direction targets the development of a unified generative model for cross-sequence MRI synthesis. In NSCLC patients at stage IV, brain, spine, and liver MRI acquisitions are routinely performed to characterize metastatic disease; however, complete multi-sequence coverage, spanning T1-weighted, T2-weighted, FLAIR, SWI, and MRA protocols, is rarely available in clinical practice due to acquisition constraints, scanner availability, and patient compliance. A generative model capable of synthesizing missing MRI sequences from available ones would directly enrich the virtual scanning component of the DIGILUNG digital twin, enabling clinicians to inspect complementary tissue contrasts without additional acquisitions, reducing patient burden and radiation-associated procedures. Training such a model requires a dataset with sufficient scale, sequence diversity, and paired study-level organization, conditions that MR-RATE uniquely satisfies among publicly available neuroimaging resources. The model architecture will build on recent advances in conditional diffusion models and unpaired image-to-image translation, extending them to the multi-sequence setting where any subset of sequences may be absent at inference time. The second direction targets automated structured report generation from MRI volumes. Within the DIGILUNG digital twin framework, the ability to generate clinically structured radiology reports conditioned on imaging inputs, organised into findings and impression sections, represents a key step toward closing the loop between virtual scanning and clinical decision support. MR-RATE's paired volume-report structure, with reports reorganised into standardised sections via a validated LLM-based pipeline, provides the supervision signal required to train and benchmark vision-language models for this task. Experiments will specifically address robustness to missing sequences and shortcut learning behavior, extending our group's ongoing work on VLM evaluation in radiology. Both research directions contribute to DIGILUNG's broader ambition of producing a patient-specific digital twin that integrates imaging, clinical, and genomic data to support individualized treatment planning and prognosis in NSCLC. The PhD researcher carrying out this work is Filippo Ruffini (filippo.ruffini@umu.se), affiliated with Umeå University. The principal investigator and main supervisor is Prof. Paolo Soda (paolo.soda@umu.se), Umeå University, Department of Diagnostics and Intervention.