Efficient and flexible language model inference with Matryoshka Representation Learning

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1018

Type:

NAISS Small Compute

Principal Investigator:

Magnus Boman

Affiliation:

Karolinska Institutet

Start Date:

2025-07-31

End Date:

2025-12-01

Primary Classification:

10210: Artificial Intelligence

Webpage:

Allocation

Alvis at C3SE: 500 GPU-h/month
Mimer at C3SE: 500 GiB

Abstract

Transformer models are a versatile kind of AI model that forms the basis upon which contemporary machine learning systems are formed, including in language models, computer vision, and speech transcription. Contemporary transformer models have huge parameter counts and require large amounts of computational resources. As a result, training transformer-based language models now cause both economic and environmental issues. Previously, we developed transformer models that use Matryoshka Representation Learning (MRL), a technique that allows for optimisation of multiple sizes of the same model simultaneously. We implemented a method that uses MRL in all major blocks of the transformer. MRL produces multiple usable model sizes which can be selected at inference. Our results showed that MRL improves the compute-performance trade-off of transformers by (i) improving efficiency exponentially as a function of model size choice while (ii) causing only minor increases in loss. Due to constraints in available compute, the model was itself small and was trained on a very small corpus of text. Thus, although promising, the current results are limited with regards to extrapolation to production-grade models. This project aims to extend our previous findings to more robustly evaluate the performance of our proposed models in realistic scenarios. Furthermore, the ability to efficiently use smaller or larger model sizes according to user needs may aid in on-device AI, supporting user privacy. We are also developing a method to infer the required model size based on user queries, allowing the compute-performance trade-off to happen optimally. We plan to publish a scientific paper based on the results obtained from this compute project in an international forum as an outcome of the project.