As part of efforts to automate parts of the data collection efforts within the Uppsala Conflict Data Program (UCDP), we want to focus on the text-corpora management efforts. UCDP bases its effort on a large but highly heterogenous text corpus containing about one million entries, and approximately100 million tokens, collected over approximately 35 years. We want to improve our management of the corpus in two ways:
- structured data cleanup, i.e. extraction and streamlining of metadata such as article date, source, collection type etc., which is now stored in a highly unstructured, textual form. We have developed scripts to do this, but they need substantially more compute time than available otherwise.
- information extraction from the corpora, which has been manually mined previously, such as information on peace agreements, information on cease-fires etc.. We intend to fine-tune medium-size open-weight instruct-based large language model such as Mistral for classification purposes. We have piloted some approaches in congruence with partner project Peace Science Infrastructure in Oslo, but lack of compute-time has prevented us in Uppsala from deploying them at scale, thus maintaining our competitive edge.