SUPR
Multilingual Healthcare Benchmarking for Swedish
Dnr:

NAISS 2025/22-1162

Type:

NAISS Small Compute

Principal Investigator:

Seyed Alireza Molavi

Affiliation:

Högskolan i Halmstad

Start Date:

2025-08-31

End Date:

2026-09-01

Primary Classification:

10208: Natural Language Processing

Webpage:

Allocation

Abstract

This project aims to develop the first Nordic-focused rubric-based benchmark for evaluating Large Language Models (LLMs) in clinical reasoning, with a primary focus on Swedish and Norwegian medical practice. Unlike existing medical QA datasets that emphasize factual correctness, our benchmark evaluates reasoning quality, safety, and communication using structured rubrics aligned with local clinical guidelines. The core innovation is a semi-automated pipeline for rubric generation that combines LLMs, medical textbooks, and expert-in-the-loop validation, minimizing expert burden while ensuring clinical validity. This pipeline will also enable dataset augmentation by adding structured reasoning steps to existing Swedish medical QA datasets (e.g., SFAM, MedQA-SWE). The methodology involves: (1) Designing a standardized rubric schema emphasizing behavioral criteria; (2) Bootstrapping rubric generation via supervised fine-tuning on English datasets (e.g., HealthBench), then adapting to Nordic practice through iterative refinement; (3) Aligning rubric quality using reinforcement learning from AI feedback and preference signals; (4) Evaluating LLMs on reasoning validity, factual accuracy, and communication ethics using a rubric-guided evaluation framework. Expected outcomes include a public benchmark with cases, answers, and validated rubrics; a semi-automated rubric generation system; and baseline evaluations for state-of-the-art LLMs. This work addresses a critical gap in multilingual, clinically grounded evaluation of LLMs and contributes to safer, more reliable AI in healthcare for the Nordic region.