Training an OCR-free Visual Document Understanding Model to Automatically Transcribe Swedish Historical Census Data

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/5-492

Type:

NAISS Medium Compute

Principal Investigator:

Jakob Molinder

Affiliation:

Uppsala universitet

Start Date:

2024-09-27

End Date:

2025-10-01

Primary Classification:

50203: Economic History

Secondary Classification:

50901: Social Sciences Interdisciplinary

Tertiary Classification:

60101: History

Webpage:

Allocation

Rackham at UPPMAX: 88 x 1000 core-h/month
Snowy at UPPMAX: 88 x 1000 core-h/month

Abstract

Traditional OCR-dependent methods have difficulties in handling the challenges associated with historical handwriting and layout variation. Hampering the development of a more generalized and flexible model able to be trained downstream on documents with different layouts and from different time periods. We aim to remedy this problem by training the so-called Donut model on 67,345 images from the 1910 Swedish Census. Donut is an OCR-free Document Understanding Transformer trained on Chinese, Korean, and English receipts. The model is more computationally inexpensive than traditional OCR methods and is more adaptable to new document types and languages. The model processes document images into a structured json-format (Kim et al. 2022). Hence, by leveraging the capabilities of Donut we are convinced that we will be able to create a baseline model that can be trained downstream to transcribe different types of historical Swedish documents that include tabulated data into a structured format. In our preliminary tests, we observed that the model quickly adapted to different column mappings and page configurations. It also reached a cross-entropy loss as low as 0.02, and was able to make correct predictions about the contents of different columns most of the time. Nevertheless, some problems still remain. It has problems distinguishing between words if they are either acronyms or contain spelling errors. Thus far, we have utilized our own limited computational resources to train the model. But in order to finish the training within a reasonable time frame we need more computational resources. We are convinced that the finalization of this model will be of general interest to scholars within the social sciences and humanities. It would represent a fundamental step in attaining the historical data necessary to reach conclusions about questions that for long have eluded scholars. To evaluate the applications of the model we plan to utilize it in order to digitize the 1920 and 1940-censuses