SUPR
The Marco Polo Embedder: Using Synthetic Data for Sentence Similarity Search in Textual Records of Four Ancient Civilizations
Dnr:

NAISS 2024/22-1168

Type:

NAISS Small Compute

Principal Investigator:

Albin Thörn Cleland

Affiliation:

Lunds universitet

Start Date:

2024-09-09

End Date:

2025-09-01

Primary Classification:

60202: Specific Languages

Webpage:

Allocation

Abstract

Comparative research on various aspects of ancient civilizations can benefit significantly from meaning-based information retrieval. This necessitates a robust multilingual sentence transformer model capable of embedding text in languages such as English, Latin, Ancient Greek, Sanskrit, and Classical Chinese into a shared vector space. Such a model would enable simultaneous cosine-similarity search across all five languages, serving as the backend for a Retrieval-Augmented Generation (RAG) system. Riemenschneider and Frank (2023, arXiv:2308.12008) sought to compensate for the relative scarcity of training data in ancient languages through knowledge distillation, but the results are not satisfactory. Our Marco Polo model will employ contrastive learning directly on a set of parallel sentence quintuples. These quintuples will be artificially generated using a large causal language model (LLaMA 3.1 8B) fine-tuned for translation between these languages. The resulting model will be valuable for scholars comparing various phenomena or uncovering transtextual links in the textual records of these ancient civilizations. Step 1: Fine-Tune translation model First we will fine-tune LLaMA 3.1 8B using openly available datasets that contain aligned sentences between English and the four ancient languages: Latin, Ancient Greek, Sanskrit, and Classical Chinese. The goal is to develop a translation model capable of effectively translating between English and these ancient languages. Step 2: Generate Synthetic Training Data Next, we will use the fine-tuned model to generate synthetic training data by creating parallel sentences across all five languages. Step 3: Fine-Tune a Larger Text Embedding Model With the synthetic training data prepared, we will fine-tune a multilingual text embedding model, such as the BAAI/bge-m3, using contrastive learning. This process will create embeddings that effectively represent sentences in all five languages within a unified vector space. In addition, we will fine-tune a reranker model, such as the BAAI/bge-reranker-v2-m3m on the same dataset. Step 4: Evaluate and Test the Models Finally, we will evaluate the performance of the fine-tuned models on a sample set of Latin translations of Confucius Analects, the Rigveda and Pseudo-Dionysius. Subsequently, we will deploy the models within a Retrieval-Augmented Generation (RAG). We will release three models—MP-trans-v1, MP-embed-v1, and MP-rerank-v1—on Hugging Face, accompanied by an article detailing the research process. The immediate goal is to support two projects in comparative philosophy: one investigating cultural similarities and differences in the psychology and norms surrounding weeping; one analyzing the spectra of aesthetic evaluative concepts in these ancient cultures. However, the models have far-reaching applications for researchers in philology, history, literature, and linguistics, enabling semantic searches across texts in multiple languages to reveal parallels, contrasts, and influences among these ancient traditions. The project sets a new baseline for this type of research and introduces a novel approach using synthetic data to address the limitations of existing datasets in ancient languages, thereby contributing to the broader fields of natural language processing and machine learning for low-resource languages.