SUPR
Synthetisizing training data for NER in the clinical domain
Dnr:

NAISS 2024/22-965

Type:

NAISS Small Compute

Principal Investigator:

Thomas Vakili

Affiliation:

Stockholms universitet

Start Date:

2024-08-22

End Date:

2025-03-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Webpage:

Allocation

Abstract

Large language models (LLMs) are vulnerable to attacks that cause them to leak information about their training data. In many applications, e.g., in the clinical domain, these risks are unacceptable. In response, several privacy-preserving techniques have been proposed. One promising technique is to do away with sensitive training data and instead train using synthesized training data. In this project, we will fine-tune several state-of-the-art LLMs to synthesize training data for performing named entity recognition in Spanish, English, and Swedish. We will use mono-lingual and multi-lingual models of different sizes. The aim is to provide insights into how well this privacy-preserving technique works, and what resources are required to implement it.