LLM-OCR for Historical Documents: Integrating Humanities Data and Neural Models

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-977

Type:

NAISS Small

Principal Investigator:

Mats Fridlund

Affiliation:

Göteborgs universitet

Start Date:

2026-06-01

End Date:

2027-06-01

Primary Classification:

10208: Natural Language Processing

Webpage:

Allocation

Arrhenius GPU at NAISS: 500 GPU-h/month
Arrhenius Disk at NAISS: 250 GiB

Abstract

This project explores the use of Large Language Models (LLMs) to improve Optical Character Recognition (OCR) for historical documents. Traditional OCR systems often struggle with noisy inputs, non-standard orthography, and degraded source material commonly found in archival texts. We aim to investigate how LLMs can be leveraged to enhance transcription accuracy by incorporating contextual linguistic knowledge and domain-specific historical data. Our approach bridges computational methods and humanities scholarship by combining digitized historical sources with modern LLM-based techniques. We will explore methods where LLMs act both as post-OCR correction systems and as integrated components in transcription pipelines, capable of resolving ambiguities and reconstructing partially illegible text. By evaluating performance across a range of historical corpora, including varying scripts, orthographic conventions, and levels of degradation, we aim to identify robust strategies for improving OCR quality. The outcomes of this project have the potential to significantly enhance the accessibility and usability of cultural heritage data, while contributing to ongoing research in NLP, document analysis, and digital humanities.