The goal of this project is to train cognitively inspired language models that better reflect human learning processes. Unlike current large language models (LLMs), which rely on ever-expanding datasets and energy-hungry architectures, our approach leverages insights from human linguistic behavior to develop more efficient training strategies.
We pursue two main directions: (1) pre-training with constrained data budgets (e.g., ~100M tokens, the estimated linguistic exposure of a 13-year-old), and (2) studying the linguistic abilities of multimodal models on diverse tasks such as coherent text generation and grammaticality judgments. This requires extensive architecture search, hyper-parameter optimization, and fine-tuning of pre-trained models. All models, code, and results will be released openly under permissive licenses in line with VR guidelines.
Our datasets include Visual Writing Prompts (200 GB), BabyLM (20 GB), and multimodal resources such as MSCOCO, Visual Genome, and VisDial (~700 GB). We rely only on publicly available, non-personal data.
We will acknowledge the public infrastructure used for computation in our scientific publications.