Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

使用 Hub Bucket 传输万亿参数:TRL 中的 Delta 权重同步

TL;DR, because you have models to train and we respect that: Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine. For a 7B in bf16 that is 14 GB. For a frontier 1T model checkpoint that is on the order of a terabyte. Per step. It turns out you do not have to. 长话短说,因为你有模型要训练,我们尊重这一点:异步强化学习(Async RL)有一个不可告人的秘密:在每一步训练中,训练器都必须将整个模型发送给推理引擎。对于 bf16 格式的 7B 模型,这需要 14 GB;而对于前沿的 1T 参数模型检查点,则需要约 1 TB 的数据量。而且是每一步都要传输。事实证明,你完全没必要这样做。

Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical (and never less than 98% in the worst case). The actual delta is tiny. We landed a TRL PR that encodes just the changed elements as a sparse safetensors file, uploads it to a Hugging Face Bucket, and tells vLLM to fetch it. On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB. 在两个连续的强化学习优化器步骤之间,大约 99% 的 bf16 权重在位级别上是完全相同的(最坏情况下也不会低于 98%)。实际的增量非常小。我们提交了一个 TRL PR,它仅将发生变化的元素编码为稀疏的 safetensors 文件,上传到 Hugging Face Bucket,并通知 vLLM 进行拉取。在 Qwen3-0.6B 模型上,每一步的传输负载从 1.2 GB 降低到了 20 到 35 MB。

The cherry on top: we ran a full disaggregated training where the trainer was on one box, vLLM lived in a Hugging Face Space, the Wordle environment lived in another Space, and weights flowed through a single Hub bucket. No shared cluster, no RDMA, no VPN. Async RL just got a lot cheaper. Read on. 锦上添花的是:我们进行了一次完全解耦的训练,其中训练器位于一台机器上,vLLM 运行在 Hugging Face Space 中,Wordle 环境运行在另一个 Space 中,而权重则通过单个 Hub bucket 进行传输。无需共享集群,无需 RDMA,无需 VPN。异步强化学习变得便宜多了。请继续阅读。

1. The One Terabyte Problem

1. 一万亿字节的问题

If you read our previous post on the landscape of async RL training, you already know the punchline. Every async RL library, regardless of how it spells “actor model” or which color its NCCL backend is painted, eventually trips over the same root: weight synchronization. The inference engine speaks the policy of step N. The trainer just finished step N+1. The fresh weights have to get from one side to the other before the inference engine starts drifting hopelessly off-policy. 如果你读过我们之前关于异步强化学习训练现状的文章,你已经知道结论了。每一个异步强化学习库,无论它如何定义“actor 模型”,或者其 NCCL 后端是什么配置,最终都会被同一个根本问题绊倒:权重同步。推理引擎使用的是第 N 步的策略,而训练器刚刚完成了第 N+1 步。在推理引擎开始严重偏离策略之前,必须将最新的权重从一端传输到另一端。

This sits on the critical path whether you are running sync or async: a blocking transfer is wasted idle compute of GPUs not generating tokens. With a sparse delta path you collapse that idle time into seconds, and the trainer does not even have to wait for the inference engine to be ready: it just publishes “weights ready” and uploads the weights to the shared bucket the moment its optimizer step finishes, while the inference engine fetches on its own time. 无论你运行的是同步还是异步训练,这都处于关键路径上:阻塞式传输意味着 GPU 在不生成 token 时处于闲置浪费状态。通过稀疏增量路径,你可以将这段闲置时间缩短至几秒钟,而且训练器甚至不需要等待推理引擎就绪:它只需在优化器步骤完成后发布“权重就绪”信号并将权重上传到共享 bucket,而推理引擎则可以在自己的时间进行拉取。

2. Why bf16 RL Weights Are Almost Always Sparse

2. 为什么 bf16 强化学习权重几乎总是稀疏的

Before we wire anything up, it is worth understanding why this whole game is even winnable. The “98% of weights do not change” claim sounds suspiciously like one of those numbers that works in the demo and falls apart in the wild. It is not. It falls out of how bf16 arithmetic works at the learning rates RL uses. 在进行任何连接之前,值得理解为什么这个方案是可行的。“98% 的权重不发生变化”这一说法听起来很像那种在演示中有效但在实际应用中会失效的数据。但事实并非如此。这源于 bf16 算术在强化学习所使用的学习率下的工作方式。

A bf16 number has 7 mantissa bits. Between two consecutive powers of two, there are exactly $2^7=128$ representable values, so the spacing between adjacent bf16 numbers around $|w|$ is roughly $|w| \cdot 2^{-7}$. An update gets absorbed by the bf16 cast whenever it sits below half of that spacing, i.e., when $|\Delta w| < |w|/256$. bf16 数字有 7 位尾数。在两个连续的 2 的幂之间,恰好有 $2^7=128$ 个可表示的值,因此 $|w|$ 附近相邻 bf16 数字之间的间距大约是 $|w| \cdot 2^{-7}$。当更新量小于该间距的一半时,即当 $|\Delta w| < |w|/256$ 时,更新会被 bf16 的转换过程吸收。

Now look at what Adam does. At an RL learning rate of, say, $3 \times 10^{-6}$, the update to a single weight is: $\Delta w = -\eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}$. The normalized step $\hat{m}/(\sqrt{\hat{v}}+\epsilon)$ is roughly order one, so $|\Delta w| \approx \eta \approx 3 \times 10^{-6}$. For most weights, $|w|$ sits somewhere around $10^{-2}$ to $10^{-1}$. The threshold $|w|/256$ at that magnitude is around $4 \times 10^{-5}$ to $4 \times 10^{-4}$, which is bigger than the update. In other words: the optimizer is whispering, and bf16 cannot hear it. 现在看看 Adam 优化器做了什么。在强化学习的学习率为 $3 \times 10^{-6}$ 时,单个权重的更新量为:$\Delta w = -\eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}$。归一化步长 $\hat{m}/(\sqrt{\hat{v}}+\epsilon)$ 大约是 1 阶,因此 $|\Delta w| \approx \eta \approx 3 \times 10^{-6}$。对于大多数权重,$|w|$ 位于 $10^{-2}$ 到 $10^{-1}$ 之间。在该量级下,阈值 $|w|/256$ 大约为 $4 \times 10^{-5}$ 到 $4 \times 10^{-4}$,这比更新量还要大。换句话说:优化器在低声细语,而 bf16 听不见。