NAISS
SUPR
NAISS Projects
SUPR
Modularising large language models
Dnr:

NAISS 2025/5-725

Type:

NAISS Medium Compute

Principal Investigator:

Marcel Bollmann

Affiliation:

Linköpings universitet

Start Date:

2026-01-01

End Date:

2027-01-01

Primary Classification:

10208: Natural Language Processing

Secondary Classification:

10201: Computer Sciences

Webpage:

Allocation

Abstract

Our project investigates modularisation strategies for decoder-based large language models (LLMs) to better support different languages, language groups, and domains. We focus on three approaches: modular pre-training, continued pre-training for cross-lingual knowledge transfer, and post-training reinforcement learning to improve alignment and factual accuracy in low-resource languages. Modular pre-training addresses the limitations of monolithic LLMs, which often struggle to represent diverse languages and domains efficiently. By introducing specialised modules during pre-training, the model can grow dynamically, adding new capabilities without retraining the entire model. Modular architectures also improve interpretability, allowing us to study how different language or domain components interact. We will pre-train modular models starting with a base configuration and add modules incrementally for additional languages or domains. We will explore approaches such as Sparse Mixture of Experts (SMoE) with soft expert routing for language-specific components, and Tokenformer-style architectures with sparsified token-parameter attention. This setup is flexible, enabling modules to be added or removed as needed, and allows comparisons with non-modular baselines to better understand knowledge interactions. Continued pre-training targets the challenge of low-resource languages, which often have poor factual knowledge representation. We will continue training on multilingual corpora with strategies that enhance cross-lingual transfer. These include using parallel data to improve alignment, applying transliteration for non-Latin languages, leveraging tokenization-free approaches, partially bypassing language-specific layers, and integrating modules or adapters that specialize in low-resource languages while reusing knowledge from high-resource languages. This process helps low-resource languages inherit factual knowledge efficiently while maintaining overall language modeling performance. Post-training reinforcement learning provides an additional layer of improvement. The model will be fine-tuned using feedback signals that reward correct factual reasoning in low-resource languages. Reward modeling will rely on high-resource language knowledge, such as high-quality reference answers or retrieved information. Reinforcement learning fine-tuning will optimize the model to reproduce facts accurately in low-resource languages while maintaining fluency and contextual coherence. This stage complements modular and continued pre-training strategies, forming a complete pipeline that maximizes cross-lingual knowledge transfer and model reliability. Overall, our approach combines modular pre-training, targeted continued pre-training, and reinforcement learning to improve multilingual and cross-domain capabilities of decoder-based LLMs. By focusing on scalability, interpretability, and efficient knowledge transfer, we aim to create models that perform well across high- and low-resource languages while enabling flexible updates and extensions over time.