LTU Document Analysis

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/3-244

Type:

NAISS Medium

Principal Investigator:

Elisa Hope Barney Smith

Affiliation:

Luleå tekniska universitet

Start Date:

2026-04-28

End Date:

2026-11-01

Primary Classification:

10207: Computer graphics and computer vision (System engineering aspects at 20208)

Webpage:

https://www.ltu.se/en/staff/e/elisa-barney-smith

Allocation

Alvis at C3SE: 5000 GPU-h/month
Mimer at C3SE: 2500 GiB
Arrhenius GPU at NAISS: 2000 GPU-h/month
Arrhenius Disk at NAISS: 1250 GiB
Arrhenius Flash at NAISS: 650 GiB

Abstract

The proposal revolves around several machine learning projects in the field of historical document analysis. We currently envision the following projects throughout the duration phase of this application: Open-set Text Recognition addresses transcribing text from image where completely unknown character may appear in the image. The model is expected to raise anomalies upon such occurrence to trigger human intervention, then allow the user to gain recognition capability by register a template to the model without re-training or finetuning. This technology promotes inclusiveness and enables recognizing historical and modern scripts of extremely low resources. In this work, we plan to scale this task beyond the scope of text. Domain Adaptation for HTR. In this research, we propose an alternative approach based on domain adaptation for smaller HTR models. The objective is to adapt previously learned knowledge to new datasets by aligning feature representations across domains. This approach aims to reduce or potentially eliminate the need for extensive labelled training data when deploying HTR models on unseen datasets. Mixture-of-Expert for HTR. We propose a modular CRNN architecture that incorporates multiple language-specific LSTM blocks, each acting as an expert for a particular language. We plan to adopt a Mixture-of-Experts (MoE) framework, where a learnable router dynamically selects the most appropriate language expert based on the input features. The router will be trained to predict language probabilities and activate the corresponding LSTM block(s). We further hypothesize that, when presented with previously unseen languages, the MoE framework can still provide a transcription obtained by combining the outputs of multiple experts weighted by the router’s language probabilities. Line-Segmentation and HTR Trained Together. We propose to jointly train a text line segmentation model and an HTR model in an end-to-end framework. Instead of treating segmentation and recognition as independent stages, we aim to integrate them into a unified architecture by propagating the recognition loss from the HTR model back to the line segmentation model. By allowing the segmentation model to receive feedback from the recognition objective, it can learn to produce text line boundaries that are more suitable for transcription rather than strictly adhering to human-defined annotations. This approach enables the segmentation component to become recognition-aware, potentially improving overall system performance. Extended Detection and Recognition of Ancient Egyptian Characters. Based on previous research results, we plan to develop an extended framework for the recognition of ancient Egyptian hieratic characters that provides character candidates from documents, offers few-shot recognition and novelty detection, and enables the integration of human expertise to restructure and correct class definitions.