NAISS
SUPR
NAISS Projects
SUPR
Automation of Data Collection Processes for Uppsala Conflict Data Program
Dnr:

NAISS 2025/22-1372

Type:

NAISS Small Compute

Principal Investigator:

Mert Can Yilmaz

Affiliation:

Uppsala universitet

Start Date:

2025-10-13

End Date:

2026-11-01

Primary Classification:

50604: Peace and Conflict Studies

Allocation

Abstract

As part of efforts to automate parts of the data collection efforts within the Uppsala Conflict Data Program (UCDP), we want to focus on the text-corpora management efforts. UCDP bases its effort on a large but highly heterogenous text corpus containing about one million entries, and approximately100 million tokens, collected over approximately 35 years. We want to improve our management of the corpus in two ways: - structured data cleanup, i.e. extraction and streamlining of metadata such as article date, source, collection type etc., which is now stored in a highly unstructured, textual form. We have developed scripts to do this, but they need substantially more compute time than available otherwise. - information extraction from the corpora, which has been manually mined previously, such as information on peace agreements, information on cease-fires etc.. We intend to fine-tune medium-size open-weight instruct-based large language model such as Mistral for classification purposes. We have piloted some approaches in congruence with partner project Peace Science Infrastructure in Oslo, but lack of compute-time has prevented us in Uppsala from deploying them at scale, thus maintaining our competitive edge.