SUPR
Cognitive Foundations of Language Modeling
Dnr:

NAISS 2024/5-483

Type:

NAISS Medium Compute

Principal Investigator:

Sharid Loáiciga

Affiliation:

Göteborgs universitet

Start Date:

2024-09-26

End Date:

2025-04-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Secondary Classification:

10207: Computer Vision and Robotics (Autonomous Systems)

Tertiary Classification:

10204: Human Computer Interaction (Social aspects to be 50803)

Allocation

Abstract

The goal of the project is to train cognitively-inspired language models that more closely mimic human learning processes. Current AI is almost universally powered by large language models (LLMs). LLMs, however, are fed with ever growing amounts of data in order to increase their generalization abilities, which in turn grows the size and energy needs of the models themselves. Our project makes use of insights about human linguistic behavior to limit the indiscriminateness of present-day approaches. Our approach includes two primary directions: First, we will train language models using Active Learning techniques, which optimize the use of moderate amounts of data. This method allows the model to focus on the most informative examples, enhancing performance with fewer training samples. We aim for a model trained on about 100 million tokens (the estimated amount of data a 12 y.o. is exposed to). Second, we will train multimodal models that incorporate not only text but also visual and speech inputs, aligning more closely with the way humans process information across different senses. How to combine these modalities into robust and coherent language representations is an empirical question. In both directions, the project will involve extensive architecture search, hyper-parameter optimization, and other standard components of model development from scratch. In addition, we will experiment with diverse pre-trained models, requiring fine-tuning to adapt them to our specific needs. These models will be extensively evaluated and fine-tuned in downstream tasks for comparison with the models built from the ground up. Our models, code and results will be published in accordance with the VR guidelines, i.e, openly available under permissive licenses. We do not use personal data and rely on publicly available datasets widely used in the computational linguistics community. We will acknowledge the public infrastructure used for computation in our scientific publications.