This project explores the use of Large Language Models (LLMs) to improve Optical Character Recognition (OCR) for historical documents. Traditional OCR systems often struggle with noisy inputs, non-standard orthography, and degraded source material commonly found in archival texts. We aim to investigate how LLMs can be leveraged to enhance transcription accuracy by incorporating contextual linguistic knowledge and domain-specific historical data.
Our approach bridges computational methods and humanities scholarship by combining digitized historical sources with modern LLM-based techniques. We will explore methods where LLMs act both as post-OCR correction systems and as integrated components in transcription pipelines, capable of resolving ambiguities and reconstructing partially illegible text.
By evaluating performance across a range of historical corpora, including varying scripts, orthographic conventions, and levels of degradation, we aim to identify robust strategies for improving OCR quality. The outcomes of this project have the potential to significantly enhance the accessibility and usability of cultural heritage data, while contributing to ongoing research in NLP, document analysis, and digital humanities.