My PhD project seeks to clarify the dynamics behind the long-run development process – through the lens of gender equality – by focusing on the role of women inventors. Very little is known about how much women patented during the phase of industrialization that led to the economic development of Western economies. A fundamental part of this has been to create and improve a database of almost 400,000 patents, essentially all patents of invention taken in France during the 19th-century.
An important next step, to be able to know more about the individuals themselves (meaning beyond the patent-level of analysis) is to accurately identify and deduplicate individuals from archival patent records. I have done so for all the women inventors during this period with both ML and manual work. However, for their male counterparts the number of record pairs to be evaluated exceeds 1.6 million. This part of the project, for which I need the NAISS resources, addresses this by developing and evaluating advanced machine learning (ML) techniques for entity resolution, focusing on a large dataset of male inventors.
A novel component of this research involves leveraging Vision-Language Models (VLMs), such as Moondream2 and Qwen-VL variants, to extract and structure critical information directly from scanned handwritten historical documents. This VLM-driven data augmentation will enrich inventor profiles, particularly where existing metadata is sparse, thereby improving the accuracy of the downstream entity resolution task.
The core computational work involves two main AI/ML thrusts:
1. VLM-based Information Extraction: Implementing and running inference with VLMs (e.g., Moondream2 2B, Qwen-VL 7B-32B) to process a corpus of scanned documents, extracting key textual details relevant to inventor identification. This will be done in a somewhat limited capacity depending on the resource allocation. The reason for running with more than a single model is that one of my best strategies for handling hallucinations and data contaminations form models is by using agreement between model outputs as a safety measure for accepting output.
2. Deep Learning Entity Resolution: Adapting and extending architectures similar to the Ditto framework (Li, et al., 2023). This involves using Siamese networks with text embeddings (from sentence-transformers and VLM-extracted text) combined with structured features.
The Alvis HPC resource is critical for these computationally intensive tasks. This includes VLM inference on image data, extensive experimentation to develop the entity resolution model as it includes hyperparameter optimization (Optuna), k-fold cross-validation, iterative feature engineering, and training models on approximately 100,000 labeled pairs (from the women inventor data). The trained models will then perform inference on roughly 1.6 million unlabelled male inventor records.
Successful completion will yield a high-quality, deduplicated dataset of male inventors and a methodology for enhancing historical data quality using VLMs. This contributes to both economic historical study and the application of AI/ML in digital humanities and the methodology will be part of my thesis and a published paper. Alvis's GPU resources are essential for the fulfilment of this project.