Training LLMs for Historical NER

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-379

Type:

NAISS Small

Principal Investigator:

Crina Tudor

Affiliation:

Stockholms universitet

Start Date:

2026-02-24

End Date:

2027-03-01

Primary Classification:

10208: Natural Language Processing

Webpage:

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 250 GPU-h/month

Abstract

Fine-tuning BERT-based models on historical data for NER is inherently compute-intensive because the challenges of historical text cannot be addressed through lightweight adaptation alone. Historical corpora exhibit substantial distributional divergence from modern training data, either due to OCR noise, orthographic variation, diachronic language change, and domain-specific entity usage, which requires continued pretraining and controlled fine-tuning over large token volumes to meaningfully adapt the underlying representations. These processes rely on iterative gradient updates over long sequences and large parameter spaces, making GPU acceleration essential for both feasibility and scientific rigor. Without sufficient GPU hours, experiments would be limited to shallow or unstable adaptations, undermining reproducibility, preventing systematic ablation across languages and time periods, and ultimately weakening empirical claims about model robustness in low-resource historical settings. Adequate GPU resources are therefore not a matter of convenience, but a prerequisite for conducting methodologically sound research on historically grounded NER models. The main goal of this research is to investigate how continued pretraining of a historical multilingual BERT model enables effective domain adaptation for named entity recognition across languages, time periods, and varying levels of annotated data availability. This work is conducted under the supervision of Beáta Megyesi (main) and Robery Östling (co-supervisor).