SUPR
Cognitive Foundations of Language Modeling
Dnr:

NAISS 2024/6-297

Type:

NAISS Medium Storage

Principal Investigator:

Sharid Loáiciga

Affiliation:

Göteborgs universitet

Start Date:

2024-09-27

End Date:

2025-04-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Secondary Classification:

10207: Computer Vision and Robotics (Autonomous Systems)

Tertiary Classification:

10204: Human Computer Interaction (Social aspects to be 50803)

Allocation

Abstract

The goal of the project is to train cognitively-inspired language models that more closely mimic human learning processes. Current AI is almost universally powered by large language models (LLMs). LLMs, however, are fed with ever growing amounts of data in order to increase their generalization abilities, which in turn grows the size and energy needs of the models themselves. Our project makes use of insights about human linguistic behavior to limit the indiscriminateness of present-day approaches. Our approach includes two primary directions: First, we will train language models using Active Learning techniques, which optimize the use of moderate amounts of data. This method allows the model to focus on the most informative examples, enhancing performance with fewer training samples. We aim for a model trained on about 100 million tokens (the estimated amount of data a 12 y.o. is exposed to). Second, we will train multimodal models that incorporate not only text but also visual and speech inputs, aligning more closely with the way humans process information across different senses. How to combine these modalities into robust and coherent language representations is an empirical question. In both directions, the project will involve extensive architecture search, hyper-parameter optimization, and other standard components of model development from scratch. In addition, we will experiment with diverse pre-trained models, requiring fine-tuning to adapt them to our specific needs. Our models, code and results will be published in accordance with the VR guidelines, i.e, openly available under permissive licenses. We do not use personal data and rely on publicly available datasets widely used in the computational linguistics community. For training data, we will be mainly relying on Visual Writing Prompts by Hong 2023 (200GB), BabyLM (20GB), and visual features from MSCOCO, Visual Genome and VisDial (700GB). In this project, we want to do more comprehensive hyperparameter search that we can currently do with our in-house resources. This means that a lot of checkpoints will be stored for comparison purposes and investigating model robustness. For example, we want to test our learning algorithms over large batch sized over multiple GPUS, model size, layer numbers, types of input embeddings, network connectivity, learning rate and several more parameters related to our novel learning method proposal. All in all, we estimate 10 to 15 additional parameters to be search and combinations thereof. All of these will require not only model computation but also checkpoints to be saved in order to analyses the model characteristics over training time. In terms of checkpoints storage specifically, we estimate that we will need 4GB per model. Over the training time, we need to save on average 50 checkpoints per model. For each of these, we need 5 runs, summing up to 1000GB and at least 15 combinations of hyperparameters. So sums up to approximately 15TB and 5TB for safety margin in case some models are bigger than expected. We will acknowledge the public infrastructure used for computation in our scientific publications.