SUPR
Multilingual and low-resource natural language processing
Dnr:

NAISS 2023/6-76

Type:

NAISS Medium Storage

Principal Investigator:

Marcel Bollmann

Affiliation:

Linköpings universitet

Start Date:

2023-03-30

End Date:

2024-04-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Secondary Classification:

10201: Computer Sciences

Allocation

Abstract

Natural language processing (NLP) has seen significant advances through the use of deep learning, but transferring those successes to multilingual or low-resource scenarios remains a challenging problem. Recent multilingual models, such as mBERT, mT5, or BLOOM, still lag behind in performance for many languages that do not have as many available resources as English. Training new models in low-resource scenarios is challenging, as many state-of-the-art advances in deep learning for NLP assume large quantities of training data and compute. This project will research techniques for improving NLP in those scenarios, e.g., multilingual representation learning, machine translation involving under-resourced languages (e.g. Creoles), NLP for historical documents, or methods for cross-lingual adaption – e.g., exploring how models can be adapted to Swedish with lower computational and data resources. The project will build on previous work on morphologically-informed representations and historical NLP by the PI (https://marcel.bollmann.me/projects/) as well as collaboration with other members of the LiU NLP Group with expertise in transfer learning and representation learning. Potential impacts of this project include the improvement of NLP tools for speakers of currently under-resourced languages, improved accessibility of historical documents for both researchers and laypeople, as well as advancing the state of the art for training generic representation learning models in highly multilingual scenarios. The project can also lead to new efficient methods that bring the performance of large language models for English to languages with less resources.