Main supervisor: Per Stenstrom
Affiliation: Chalmers
Abstract:
The deployment of Large Language Models (LLMs) on client devices is fundamentally constrained by memory bandwidth, particularly during the autoregressive decode phase where single-user workloads fail to saturate the compute capabilities of modern Neural Processing Units (NPUs). While lossy compression techniques like quantization are widely adopted, they necessitate accuracy trade-offs. Lossless compression, conversely, has seen limited utility in inference pipelines due to the runtime overhead of decompression and mismatches between compression formats and hardware access patterns.
Our work proposes CoCo (**Co**mpile-time **Co**mpression), a compiler-driven framework for lossless compression specifically designed for spatial NPU architectures. We leverage the abundance of idle compute tiles during memory-bound operations to perform software-based decompression, effectively converting wasted cycles into memory bandwidth. Our approach relies on the static nature of LLM weights and the predictability of decode-phase access patterns to perform offline format selection and cost modeling. By integrating with the MLIR-based IREE compiler, CoCo aligns compression formats with loop tiling and DMA granularities, ensuring that decompression does not stall the execution pipeline.