CoCo: Compile-time Lossless Compression for LLM Inference on NPUs

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2026/4-306

Type:

NAISS Small

Principal Investigator:

Konstantinos Ioannis Sotiropoulos Pesiridis

Affiliation:

Chalmers tekniska högskola

Start Date:

2026-02-13

End Date:

2027-03-01

Primary Classification:

20206: Computer Systems

Webpage:

Allocation

Mimer at C3SE: 1000 GiB
Alvis at C3SE: 500 GPU-h/month

Abstract

Main supervisor: Per Stenstrom Affiliation: Chalmers Abstract: The deployment of Large Language Models (LLMs) on client devices is fundamentally constrained by memory bandwidth, particularly during the autoregressive decode phase where single-user workloads fail to saturate the compute capabilities of modern Neural Processing Units (NPUs). While lossy compression techniques like quantization are widely adopted, they necessitate accuracy trade-offs. Lossless compression, conversely, has seen limited utility in inference pipelines due to the runtime overhead of decompression and mismatches between compression formats and hardware access patterns. Our work proposes CoCo (**Co**mpile-time **Co**mpression), a compiler-driven framework for lossless compression specifically designed for spatial NPU architectures. We leverage the abundance of idle compute tiles during memory-bound operations to perform software-based decompression, effectively converting wasted cycles into memory bandwidth. Our approach relies on the static nature of LLM weights and the predictability of decode-phase access patterns to perform offline format selection and cost modeling. By integrating with the MLIR-based IREE compiler, CoCo aligns compression formats with loop tiling and DMA granularities, ensuring that decompression does not stall the execution pipeline.