NAISS
SUPR
NAISS Projects
SUPR
Cognitive Foundations of Language Modeling
Dnr:

NAISS 2025/6-359

Type:

NAISS Medium Storage

Principal Investigator:

Sharid Loáiciga

Affiliation:

Göteborgs universitet

Start Date:

2025-10-01

End Date:

2026-10-01

Primary Classification:

10208: Natural Language Processing

Secondary Classification:

10210: Artificial Intelligence

Tertiary Classification:

20208: Computer Vision and learning System (Computer Sciences aspects in 10207)

Allocation

Abstract

The goal of this project is to train cognitively inspired language models that better reflect human learning processes. Unlike current large language models (LLMs), which rely on ever-expanding datasets and energy-hungry architectures, our approach leverages insights from human linguistic behavior to develop more efficient training strategies. We pursue two main directions: (1) pre-training with constrained data budgets (e.g., ~100M tokens, the estimated linguistic exposure of a 13-year-old), and (2) studying the linguistic abilities of multimodal models on diverse tasks such as coherent text generation and grammaticality judgments. This requires extensive architecture search, hyper-parameter optimization, and fine-tuning of pre-trained models. All models, code, and results will be released openly under permissive licenses in line with VR guidelines. Our datasets include Visual Writing Prompts (200 GB), BabyLM (20 GB), and multimodal resources such as MSCOCO, Visual Genome, and VisDial (~700 GB). We rely only on publicly available, non-personal data. We will acknowledge the public infrastructure used for computation in our scientific publications.