DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

SUPR uses JavaScript for certain functions. We cannot guarantee that you will be able to use the system with JavaScript disabled.

Dnr:

NAISS 2025/22-1056

Type:

NAISS Small Compute

Principal Investigator:

Shuyi Ren

Affiliation:

Linköpings universitet

Start Date:

2025-11-20

End Date:

2026-01-01

Primary Classification:

10101: Mathematical Analysis

Webpage:

Allocation

Alvis at C3SE: 1000 GPU-h/month
Centre Storage at NSC: 500 GiB
Mimer at C3SE: 500 GiB
Tetralith at NSC: 20 x 1000 core-h/month

Abstract

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton–Schulz iterations—a technique inherited from its centralized predecessor, Muon—and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for finding an approximate stochastic stationary solution. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over communication graphs with varying degrees of connectivity, including complete graphs, directed exponential graphs, and ring graphs. Our numerical results demonstrate a clear margin of improvement of DeMuon over widely used decentralized algorithms across different network topologies.