Traditional OCR-dependent methods have difficulties in handling the challenges
associated with historical handwriting and layout variation. Hampering the development of a
more generalized and flexible model able to be trained downstream on documents with
different layouts and from different time periods. We aim to remedy this problem by training
the so-called Donut model on 67,345 images from the 1910 Swedish Census. Donut is an
OCR-free Document Understanding Transformer trained on Chinese, Korean, and English
receipts. The model is more computationally inexpensive than traditional OCR methods and
is more adaptable to new document types and languages. The model processes document
images into a structured json-format (Kim et al. 2022). Hence, by leveraging the capabilities
of Donut we are convinced that we will be able to create a baseline model that can be trained
downstream to transcribe different types of historical Swedish documents that include
tabulated data into a structured format.
In our preliminary tests, we observed that the model quickly adapted to different column
mappings and page configurations. It also reached a cross-entropy loss as low as 0.02, and
was able to make correct predictions about the contents of different columns most of the time.
Nevertheless, some problems still remain. It has problems distinguishing between words if
they are either acronyms or contain spelling errors. Thus far, we have utilized our own
limited computational resources to train the model. But in order to finish the training within a
reasonable time frame we need more computational resources.
We are convinced that the finalization of this model will be of general interest to scholars
within the social sciences and humanities. It would represent a fundamental step in attaining
the historical data necessary to reach conclusions about questions that for long have eluded
scholars. To evaluate the applications of the model we plan to utilize it in order to digitize the
1920 and 1940-censuses