Decoupled DiLoCo: A new frontier for resilient, distributed AI training
Decoupled DiLoCo: A new frontier for resilient, distributed AI training
Decoupled DiLoCo:弹性分布式 AI 训练的新前沿
Our new distributed architecture helps to train LLMs across distant data centers - with lower bandwidth and more hardware resiliency. 我们全新的分布式架构助力在相隔遥远的数据中心之间训练大语言模型(LLM),不仅降低了带宽需求,还提升了硬件的弹性。
Training a frontier AI model traditionally depends on a large, tightly coupled system in which identical chips must stay in near-perfect synchronization. This approach is highly effective for today’s state-of-the-art models, but as we look toward future generations of scale, maintaining this level of synchronization across thousands of chips becomes a significant logistical challenge. 传统的前沿 AI 模型训练依赖于大型、紧密耦合的系统,其中相同的芯片必须保持近乎完美的同步。这种方法对于当今最先进的模型非常有效,但随着我们展望未来更大规模的模型,在数千个芯片之间维持这种同步水平已成为一项重大的后勤挑战。
Today, in a new paper we are excited to share a new approach to this problem, called Decoupled DiLoCo (Distributed Low-Communication). By dividing large training runs across decoupled “islands” of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently. 今天,我们很高兴在一篇新论文中分享解决这一问题的新方法,即 Decoupled DiLoCo(分布式低通信)。通过将大型训练任务划分到解耦的计算“孤岛”中,并让数据在它们之间异步流动,该架构能够隔离局部故障,从而确保系统的其他部分能够持续高效地进行学习。
The result is a more resilient and flexible way to train advanced models across globally distributed data centers. And crucially, Decoupled DiLoCo does not suffer the communication delays that made previous distributed methods like Data-Parallel impractical at global scale. 其结果是一种更具弹性和灵活性的方式,可以在全球分布的数据中心之间训练先进模型。至关重要的是,Decoupled DiLoCo 不会受到通信延迟的影响,而这种延迟曾使得数据并行(Data-Parallel)等先前的分布式方法在全球规模下变得不切实际。
As frontier models continue to grow in scale and complexity, we’re exploring diverse approaches to train models across more compute, locations and varied hardware. 随着前沿模型在规模和复杂性上持续增长,我们正在探索多种方法,以便在更多的计算资源、地理位置和多样的硬件上训练模型。
Figure 1: Decoupling training runs into separate “islands” of compute (learner units) allows largely uninterrupted training despite the same level of hardware failures, because the effects of those failures are isolated. 图 1:将训练任务解耦为独立的计算“孤岛”(学习单元),即使在发生同等程度的硬件故障时,也能实现基本不间断的训练,因为这些故障的影响被隔离了。
Developing more fault-tolerant asynchronous training at scale
开发更大规模、更具容错性的异步训练
Decoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations. Decoupled DiLoCo 建立在两项早期进展之上:一是引入了基于异步数据流的分布式 AI 系统 Pathways;二是大幅降低了分布式数据中心之间所需带宽的 DiLoCo,这使得在相隔遥远的位置训练大语言模型变得切实可行。
Decoupled DiLoCo brings those ideas together to train AI models more flexibly at scale. Built on top of Pathways, it enables asynchronous training across separate islands of compute (known as learner units) so that a chip failure in one area doesn’t interrupt the progress of the others. Decoupled DiLoCo 将这些理念结合起来,以更灵活的方式进行大规模 AI 模型训练。它构建于 Pathways 之上,支持跨独立的计算孤岛(称为学习单元)进行异步训练,因此一个区域的芯片故障不会中断其他区域的训练进度。
This infrastructure is also self-healing. In testing, we used a method called “chaos engineering” to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units, and then seamlessly reintegrated them when they came back online. 该基础设施还具备自我修复能力。在测试中,我们使用了一种称为“混沌工程”的方法,在训练过程中人为引入硬件故障。Decoupled DiLoCo 在整个学习单元丢失后仍能继续训练过程,并在它们重新上线时无缝地将其重新集成。
Testing Decoupled DiLoCo with Gemma 4 models demonstrated that, when hardware fails, the system maintains greater availability of learning clusters than more traditional training methods — while ultimately delivering the same benchmarked level of machine learning (ML) performance. 使用 Gemma 4 模型对 Decoupled DiLoCo 进行的测试表明,当硬件发生故障时,该系统比传统训练方法保持了更高的学习集群可用性,同时最终交付了相同的机器学习(ML)基准性能。
Figure 2: Left: The Decoupled DiLoCo approach requires orders of magnitude less bandwidth than conventional training methods, making it very efficient. Middle: With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of “goodput”, or useful training, while that of other approaches nosedives. (The first two charts are based on simulated training runs). Right: In real-world experiments, the benchmarked ML performance of Gemma 4 models trained using Decoupled DiLoCo equalled the performance attained with conventional training approaches. 图 2:左:Decoupled DiLoCo 方法所需的带宽比传统训练方法少几个数量级,因此非常高效。中:随着硬件故障水平的增加,Decoupled DiLoCo 继续提供高水平的“有效吞吐量”(goodput,即有用的训练),而其他方法的表现则急剧下降。(前两张图基于模拟训练运行)。右:在现实世界的实验中,使用 Decoupled DiLoCo 训练的 Gemma 4 模型的基准机器学习性能与传统训练方法达到的性能相当。
Decoupled DiLoCo is not only more resilient to failures, but is also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking (a level relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure between facilities). Notably, the system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the “blocking” bottlenecks where one part of the system must wait for another. Decoupled DiLoCo 不仅对故障更具弹性,而且对于执行生产级的全分布式预训练也非常实用。我们成功地利用 2-5 Gbps 的广域网在四个独立的美国区域训练了一个 120 亿参数的模型(这一带宽水平通过数据中心设施之间现有的互联网连接即可相对轻松实现,无需在设施之间建立新的定制网络基础设施)。值得注意的是,该系统实现这一训练结果的速度比传统同步方法快了 20 倍以上。这是因为我们的系统将必要的通信整合到了更长的计算周期中,避免了系统某一部分必须等待另一部分的“阻塞”瓶颈。
Driving the evolution of AI training infrastructure
推动 AI 训练基础设施的演进
At Google, we take a full-stack approach to AI training, spanning hardware, software infrastructure and research. Increasingly, gains are coming from rethinking how these layers fit together. 在 Google,我们采取全栈式 AI 训练方法,涵盖硬件、软件基础设施和研究。越来越多的收益来自于重新思考这些层级如何协同工作。
Decoupled DiLoCo is one example. By enabling training jobs at internet-scale bandwidth, it can tap any unused compute wherever it sits, turning stranded resources into useful capacity. Decoupled DiLoCo 就是一个例子。通过支持在互联网规模的带宽下进行训练任务,它可以利用任何位置的闲置计算资源,将搁置的资源转化为有用的产能。
Beyond efficiency and resilience, this training paradigm also unlocks the ability to mix different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the useful life of existing hardware… 除了效率和弹性之外,这种训练范式还解锁了在单次训练任务中混合使用不同代际硬件(如 TPU v6e 和 TPU v5p)的能力。这种方法不仅延长了现有硬件的使用寿命……