Modularising large language models

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2024/5-698

Type:

NAISS Medium Compute

Principal Investigator:

Marcel Bollmann

Affiliation:

Linköpings universitet

Start Date:

2024-12-20

End Date:

2026-01-01

Primary Classification:

10208: Language Technology (Computational Linguistics)

Secondary Classification:

10201: Computer Sciences

Webpage:

Allocation

Alvis at C3SE: 9000 GPU-h/month

Abstract

Our project investigates modularisation strategies for large language models (LLMs) targeting different languages, language groups, or domains. We explore two approaches: 1. Modular Pre-Training: Introducing modules during pre-training using architectures like sparse mixture of experts (SMoE) and Tokenformer with sparsified "pattention." 2. Post-Hoc Parameter-Efficient Adaptation: Adding modules to pre-trained models using techniques such as LoRA and adapters. ### Part 1: Modular Pre-Training Most LLMs use monolithic architectures that struggle to efficiently represent diverse languages and domains especially in non-massive sizes. Modular pre-training offers a scalable solution by enabling the creation of specialised modules. This approach supports dynamic model updates, allowing new capabilities to be added without retraining the entire model. Moreover, modular models enhance interpretability by enabling fine-grained analysis of how different language or domain components interact. These advantages make modular pre-training a promising direction for LLM development. We will pre-train modular models starting with a 125M parameter base. Modules will be incrementally added to expand supported languages or domains, following approaches such as (a) a custom Sparse Mixture of Experts (SMoE) variant that uses soft expert routing for language-specific components, and (b) the Tokenformer architecture with sparsified token-parameter attention on domain-specific or language-specific tokens. This strategy ensures flexibility, allowing modules to be added or removed based on application needs. It also enables probing and comparison with non-modular baseline models to better understand interactions between different knowledge representations in LLMs. ### Part 2: Post-Hoc Parameter-Efficient Adaptation Adapting existing pre-trained models is crucial for practical deployment. Post-hoc adaptation allows extending a model's capabilities to new languages and domains without retraining the entire model. This makes adaptation cost-effective, efficient, and scalable, enabling continuous updates as new data becomes available. Moreover, parameter-efficient fine-tuning methods like LoRA and adapters reduce computational and storage overhead, making them particularly suitable for multilingual and domain-specific NLP tasks. We will adapt existing models to new languages and domains using parameter-efficient fine-tuning. We plan to initially adapt the following models: - LLaMA 3.2-1B and 3B (due to their popularity and multilingual capacity) - SmolLM2-1.7B (open-license alternative)