SignMuon: Communication-Efficient Distributed Muon Optimization

SignMuon：通信高效的分布式 Muon 优化算法

Abstract: Distributed training of large neural networks is bottlenecked by full-precision gradient communication and by coordinatewise optimizers that ignore the matrix structure of weight tensors. We propose Sign-Muon, a 1-bit, matrix-aware optimizer that combines majority-vote sign aggregation from signSGD with the polar-step framework of Muon.

摘要： 大型神经网络的分布式训练受到全精度梯度通信以及忽略权重张量矩阵结构的坐标级优化器的瓶颈限制。我们提出了 Sign-Muon，这是一种 1-bit、具备矩阵感知能力的优化器，它结合了 signSGD 的多数投票符号聚合机制与 Muon 的极坐标步（polar-step）框架。

Each worker forms a Muon-style direction by taking the polar factor of its momentum via a Newton—Schulz iteration, transmits only the entrywise signs, and aggregates by majority vote; an optional local polar step further enforces orthogonality at no extra communication cost.

每个工作节点通过 Newton—Schulz 迭代获取其动量的极因子，从而形成 Muon 风格的优化方向；节点仅传输逐元素的符号，并通过多数投票进行聚合；可选的本地极坐标步可在不增加额外通信成本的情况下进一步强制执行正交性。

Under spectral-norm smoothness and bounded-variance stochastic gradients, the spectral-norm normalized sign step yields an $\mathcal{O}(1/\sqrt{T})$ nonconvex rate for an $\ell_1$-based stationarity measure. With unimodal symmetric noise, majority vote across $M$ workers cuts the stochastic term by $1/\sqrt{M}$, matching signSGD.

在谱范数平滑性和有界方差随机梯度的条件下，谱范数归一化的符号步对于基于 $\ell_1$ 的平稳性度量可产生 $\mathcal{O}(1/\sqrt{T})$ 的非凸收敛率。在单峰对称噪声下，跨 $M$ 个工作节点的多数投票将随机项降低了 $1/\sqrt{M}$，与 signSGD 的效果相当。

In the $\alpha$-$\beta$ model, distributed Sign-Muon needs only one integer sum-allreduce per iteration; all orthogonalization is local, giving a $32\times$ bandwidth reduction over float32 ($4\times$ for int8). Across 330 CIFAR-10/ResNet-50 configurations Sign-Muon attains the best validation accuracy (92.15%); its 4-GPU majority-vote variant reaches 92.02% with 37% less training time at matched effective batch.

在 $\alpha$-$\beta$ 模型中，分布式 Sign-Muon 每次迭代仅需一次整数 sum-allreduce 操作；所有正交化过程均在本地完成，相比 float32 实现了 32 倍的带宽缩减（相比 int8 为 4 倍）。在 330 组 CIFAR-10/ResNet-50 配置中，Sign-Muon 达到了最佳验证准确率（92.15%）；其 4-GPU 多数投票变体在有效批次相同的情况下，训练时间缩短了 37%，并达到了 92.02% 的准确率。

On nanoGPT, Sign-Muon achieves lower perplexity and better anytime performance than other sign-based baselines, with favorable weak-scaling up to 16 GPUs.

在 nanoGPT 上，Sign-Muon 相比其他基于符号的基准模型实现了更低的困惑度（perplexity）和更好的实时性能，并在多达 16 个 GPU 的规模下表现出良好的弱扩展性。