SUPR
Historical language models and cipher type detection of historical encrypted manuscripts
Dnr:

NAISS 2023/22-1324

Type:

NAISS Small Compute

Principal Investigator:

Beáta Megyesi

Affiliation:

Stockholms universitet

Start Date:

2024-01-08

End Date:

2025-02-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Allocation

Abstract

The aim of the project is to digitize, process and decrypt historical encrypted sources (ciphers) and to provide tools for (semi)-automatic transcription and decryption. We focus on the development of software tools for automatic analysis allowing users to decrypt various types of historical encrypted documents. The process of automatic decryption includes hand-written text recognition to automatically convert the manuscript images to machine-readable, transcribed text format and a mapping of symbols with a transcription scheme, the detection of the plaintext language (underlying language) of the cipher on the basis of historical text sources, the automatic identification of the cipher type, the cryptanalysis of the ciphertext, and finally its decryption. We experiment with various types of deep learning algorithms, including zero-shot and few shot learning with and without pretrained language models for the transcription and decryption of ciphers. In 2024 we plan to run experiments on cipher type detection based on images, and transcribed plaintext. We also aim to create historical language models for named entity recognition, and correction of transcription errors.