Cognitive Foundations of Language Modeling

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/6-359

Type:

NAISS Medium Storage

Principal Investigator:

Sharid Loáiciga

Affiliation:

Göteborgs universitet

Start Date:

2025-10-01

End Date:

2026-10-01

Primary Classification:

10208: Natural Language Processing

Secondary Classification:

10210: Artificial Intelligence

Tertiary Classification:

20208: Computer Vision and learning System (Computer Sciences aspects in 10207)

Webpage:

https://gu-clasp.github.io/

Allocation

Mimer at C3SE: 10000 GiB

Abstract

The goal of this project is to train cognitively inspired language models that better reflect human learning processes. Unlike current large language models (LLMs), which rely on ever-expanding datasets and energy-hungry architectures, our approach leverages insights from human linguistic behavior to develop more efficient training strategies. We pursue two main directions: (1) pre-training with constrained data budgets (e.g., ~100M tokens, the estimated linguistic exposure of a 13-year-old), and (2) studying the linguistic abilities of multimodal models on diverse tasks such as coherent text generation and grammaticality judgments. This requires extensive architecture search, hyper-parameter optimization, and fine-tuning of pre-trained models. All models, code, and results will be released openly under permissive licenses in line with VR guidelines. Our datasets include Visual Writing Prompts (200 GB), BabyLM (20 GB), and multimodal resources such as MSCOCO, Visual Genome, and VisDial (~700 GB). We rely only on publicly available, non-personal data. We will acknowledge the public infrastructure used for computation in our scientific publications.