Storage for MR-RATE Brain/Spine MRI Dataset for Foundation Model Research

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-1067

Type:

NAISS Small

Principal Investigator:

Filippo Ruffini

Affiliation:

Umeå universitet

Start Date:

2026-06-04

End Date:

2027-07-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

https://xgem.ucbm.org/overview

Allocation

Arrhenius Disk at NAISS: 5000 GiB
Arrhenius GPU at NAISS: 500 GPU-h/month

Abstract

This project requests 500 GPU-hours on Arrhenius GPU and 5 TB of storage on Arrhenius Disk in support of DIGILUNG (AI-boosted DIGital twIn for personaLized lUNG cancer care), a research project funded by the Kempe Foundation and led by Prof. Paolo Soda at Umeå University, Department of Diagnostics and Intervention. DIGILUNG develops AI-driven digital twins for personalized non-small-cell lung cancer (NSCLC) care, targeting three objectives: generative virtual scanning through cross-modality image synthesis, multimodal AI for treatment outcome prediction and prognosis, and explainable data interfaces between the physical patient and the virtual twin. The core of this allocation is the training of a unified generative model on Arrhenius GPU. The requested Arrhenius Disk storage hosts a curated subset of MR-RATE, a large-scale multimodal neuroimaging dataset of 705,254 brain and spine MRI volumes from 83,425 patients paired with anonymized radiology reports and acquisition metadata, which the training and evaluation jobs read directly during execution on Arrhenius. This is working storage for computation performed on the system; the data is neither published nor served to, nor processed on, any external infrastructure. It is the principal training and evaluation resource for two research directions central to DIGILUNG. The first and primary direction develops a unified generative model for cross-sequence MRI synthesis. In stage IV NSCLC patients, brain, spine, and liver MRI are routinely performed to characterize metastatic disease, yet complete multi-sequence coverage, spanning T1-weighted, T2-weighted, FLAIR, SWI, and MRA protocols, is rarely available due to acquisition constraints, scanner availability, and patient compliance. A model that synthesizes missing sequences from available ones would enrich the virtual scanning component of the digital twin, letting clinicians inspect complementary tissue contrasts without additional acquisitions and reducing patient burden. Such training requires sufficient scale, sequence diversity, and paired study-level organization, conditions that MR-RATE uniquely satisfies among public neuroimaging resources. The architecture will build on conditional diffusion models and unpaired image-to-image translation, extending them to the multi-sequence setting where any subset of sequences may be absent at inference. The second direction targets automated structured report generation from MRI volumes. The ability to generate clinically structured reports conditioned on imaging inputs, organised into findings and impression sections, is a key step toward closing the loop between virtual scanning and clinical decision support. MR-RATE's paired volume-report structure, with reports reorganised into standardised sections via a validated LLM-based pipeline, provides the supervision needed to train and benchmark vision-language models for this task. Both directions contribute to DIGILUNG's broader ambition of producing a patient-specific digital twin that integrates imaging, clinical, and genomic data to support individualized treatment planning and prognosis in NSCLC. The PhD researcher carrying out this work is Filippo Ruffini (filippo.ruffini@umu.se), Umeå University. The principal investigator and main supervisor is Prof. Paolo Soda (paolo.soda@umu.se), Umeå University, Department of Diagnostics and Intervention.