SUPR
Restoring Documentary Greek papyri with ML
Dnr:

NAISS 2024/22-201

Type:

NAISS Small Compute

Principal Investigator:

Eric Cullhed

Affiliation:

Uppsala universitet

Start Date:

2024-02-12

End Date:

2025-03-01

Primary Classification:

60202: Specific Languages

Webpage:

Allocation

Abstract

Much of Ancient literature survives in a deteriorated state, corrupted by the process of copying throughout thousands of years, or physically damaged in case the texts were inscribed or engraved on papyri, stone tablets, pottery shards and so on. The task of restoring the lost segments of texts has traditionally relied on the linguistic intuitions of skilled Classical philologists. Machine learning models can now be trained on all extant Greek literature to produce suggestions on how the gaps may be filled, aiding in philological research (see esp. Assael, Sommerschield, and Prag 2022). The project is to finetune an open-source T5 seq2seq model pretrained on ca 110 million words of ancient Greek (https://huggingface.co/bowphs/GreTa; also the multilingual bowphs/PhilTa with the same Greek training data) on the task of filling in gaps in a corpus of 4 million words of ancient papyri. The dataset is ready and consists of pairs (up to 512 tokens) of masked and unmasked samples, carefully simulating the ways in which papyri are usually damaged.