Multilingual and lesser-resourced natural language processing

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/22-745

Type:

NAISS Small Compute

Principal Investigator:

Marcel Bollmann

Affiliation:

Linköpings universitet

Start Date:

2024-05-23

End Date:

2025-05-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Webpage:

https://marcel.bollmann.me

Allocation

Mimer at C3SE: 500 GiB
Alvis at C3SE: 125 GPU-h/month

Abstract

Improving natural language processing (NLP) models, such as large language models, for lesser-resourced languages is a timely and challenging problem. Recent multilingual models, such as mT5, BLOOM, or m-LLaMA, still lag behind in performance for many languages that do not have as many available resources as English. Training new models in low-resource scenarios is challenging, as many state-of-the-art advances in deep learning for NLP assume large quantities of training data and compute. This project will research techniques for improving NLP in those scenarios, with an expected focus on adaptation and modularization techniques (such as LoRA, PeFT) as well as tokenization and representation learning. The project will build on previous work on morphologically-informed representations and historical NLP by the PI (https://marcel.bollmann.me/projects/) as well as collaborations with other members of the LiU NLP Group with expertise in transfer learning and representation learning. It will also contribute to ongoing and planned international research collaborations, e.g. with the NLP group at Aalborg University, Denmark. Potential impacts of this project include the improvement of NLP tools for speakers of currently under-resourced languages, improved accessibility of historical documents for both researchers and laypeople, as well as advancing the state of the art for training generic representation learning models in highly multilingual scenarios. The project can also lead to new efficient methods that bring the performance of large language models for English to languages with less resources.