SUPR
ActDisease: Computational Analysis of Historical Medical Periodicals
Dnr:

NAISS 2025/22-855

Type:

NAISS Small Compute

Principal Investigator:

Vera Danilova

Affiliation:

Uppsala universitet

Start Date:

2025-07-15

End Date:

2026-02-01

Primary Classification:

10208: Natural Language Processing

Allocation

Abstract

This project will perform large-scale processing of historical medical periodicals in four languages (English, German, French, and Swedish) published by European patient organizations between 1875 and 1990. This unique dataset was created within the ActDisease project (ERC-2021-STG-101040999) at the Department of History of Science and Ideas, Uppsala University. Our primary goal is to investigate how patient organizations shaped modern medical practices. We will achieve this by analyzing the evolution of their communicative strategies, identified by classifying the communicative purpose of their published materials over time. This analysis will enable an understanding of how these strategies developed across different countries (Sweden, Germany, France, England) and in relation to various diseases, such as diabetes, allergy, polio, and arthritis. We have digitized the dataset via Optical Character Recognition (OCR) using ABBYY and conducted initial small-scale experiments. These experiments involved using encoders and generative models for text-based classification of genre (as an indicator of communicative purpose) from the OCR output. These preliminary studies showed that while classifiers can learn our manually predefined classes of communicative purpose to some extent, their performance is significantly hampered by several factors: - Limitations of manual class definitions: The manually defined classes may not fully capture the nuanced range of communicative strategies. - There is a lack of adequate training data across all languages and specific subdomains (e.g., particular diseases, publication types) for robust classifier training. - OCR output for pages with complex layouts frequently suffers from disrupted reading order, distorting textual meaning. - Insufficient context for accurate classification: OCR-extracted text alone often lacks sufficient context (e.g., visual elements like images or tables) for the correct determination of communicative purpose. To overcome these limitations, we have experimented with Vision-Language Models (VLMs), applying them directly to page images (e.g., in zero-shot and few-shot settings). This approach offers significant advantages: - Overcoming OCR issues: Direct image processing bypasses errors associated with incorrect text recognition and reading order. - VLMs allow the inclusion of vital visual elements (photographs, tables) in the analysis, leading to a richer and more accurate understanding of communicative purpose. - Leveraging the extensive knowledge embedded in their pre-training, VLMs can help identify textual genres and communicative strategies more effectively, which will, in turn, assist us in refining and expanding our initial class definitions. Our preliminary tests indicate that processing the entire dataset’s images—even with relatively small publicly available VLM models—is computationally intensive and exceedingly time-consuming with our current resources. To effectively implement our approach and conduct a full-scale analysis of the entire corpus, which comprises around 100k of periodical pages, we need access to more powerful computational resources. A critical consideration is that a portion of our dataset is under copyright. This necessitates the local deployment and processing of VLM models, excluding the use of many third-party cloud services and underscoring the need for robust local computational infrastructure.