NAISS
SUPR
NAISS Projects
SUPR
Historical language models and cipher type detection of historical encrypted manuscripts
Dnr:

NAISS 2026/4-47

Type:

NAISS Small

Principal Investigator:

Beáta Megyesi

Affiliation:

Stockholms universitet

Start Date:

2026-02-01

End Date:

2027-02-01

Primary Classification:

10208: Natural Language Processing

Allocation

Abstract

The aim of the project is to digitize, process and decrypt historical encrypted sources (ciphers) as well as rare writings and to provide tools for (semi)-automatic transcription and decryption. We focus on the development of software tools for automatic analysis allowing users to decrypt various types of historical sources with rare writing systems. The process of automatic decipherment includes hand-written text recognition to automatically convert the manuscript images to machine-readable, transcribed text format and a mapping of symbols with a transcription scheme, the detection of the plaintext language (underlying or related language) of the script on the basis of historical text sources, the analysis of the source by adapting LLMs, and finally its decipherment. We experiment with various types of deep learning algorithms, including zero-shot and few shot learning with and without pre-trained language models for the transcription and decryption of these rare sources. In 2026 we plan to run experiments on decipherment based on images, and transcribed plaintext. We also aim to create historical language models for named entity recognition, and correction of transcription errors.