Lights Out, Systems On: Validating Instant Power Loss Readiness

熄灯但系统在线：验证瞬时断电的就绪能力

By Raghu Prabhu, Richard Johnson 作者：Raghu Prabhu, Richard Johnson

We’re introducing Instantaneous PowerLoss Storm, a new testing paradigm within Meta’s infrastructure for handling and mitigating instant or zero-notice power loss in our data centers. We’re sharing: how we built readiness to tolerate instant failures into our existing systems with defense-in-depth strategies; tradeoffs made in implementing it, and how we validated our readiness. 我们正在引入“瞬时断电风暴”（Instantaneous PowerLoss Storm），这是 Meta 基础设施内的一种全新测试范式，旨在处理和缓解数据中心发生的瞬时或零预警断电情况。我们将分享：我们如何通过纵深防御策略，将应对瞬时故障的就绪能力构建到现有系统中；在实施过程中所做的权衡；以及我们如何验证这种就绪能力。

Disaster preparedness is not optional. Hurricanes, wildfires, power supply and network disruptions, and countless more disaster scenarios all pose risks to our data center (DC) infrastructure. Early warning systems and tried-and-tested mitigation strategies already serve us well in situations where we have a few hours or more advanced warning. While these strategies have matured over time as we have expanded our DC presence, the ever-increasing size and variety of our infrastructure has demanded an increased level of preparedness for zero-notice disasters (ones that occur without any warning), such as instantaneous power loss, with minimal impact to overall fleet availability. 灾难准备并非可选项。飓风、野火、电力供应和网络中断以及无数其他灾难场景都对我们的数据中心（DC）基础设施构成威胁。对于有数小时或更长预警时间的情况，预警系统和久经考验的缓解策略已经能很好地发挥作用。随着我们数据中心规模的扩大，这些策略已日趋成熟，但基础设施规模和多样性的不断增加，要求我们必须具备更高水平的零预警灾难（即没有任何预警发生的灾难，如瞬时断电）应对能力，并确保对整体集群可用性的影响降至最低。

Instantaneous PowerLoss Storm is a new testing paradigm within Meta’s long-established Disaster Readiness (DR) “Storm” program that forms the last line of defense, and the ultimate safety net, to handle and mitigate instant or zero-notice power loss from known, emerging, and unknown risks. “瞬时断电风暴”是 Meta 长期运行的灾难就绪（DR）“风暴”计划中的一项新测试范式，它构成了最后一道防线和终极安全网，用于处理和缓解来自已知、新兴和未知风险的瞬时或零预警断电。

How We Built Readiness To Tolerate Instant Failures Into Our Existing Systems With Defense-in-Depth Strategies

我们如何通过纵深防御策略将应对瞬时故障的就绪能力构建到现有系统中

The capability to handle instant power loss had to be built from the ground up into our DC stack, from mechanical and electrical facilities to server racks, from storage to compute and the core Twine container orchestrator. Fortunately, each of these architectures was already developed with power loss tolerance as an integral component. Providing the ability to persist in-memory data when racks have lost power using batteries and Power Loss Siren (PLS) is one such capability. Having a robust DC region-wide asynchronous signaling mechanism for Twine services in the form of unavailability events (UE) is another. (A DC region — referred to as a “region” below — is one where multiple DC buildings are co-located and share common network and power connectivity). 处理瞬时断电的能力必须从底层构建到我们的数据中心堆栈中，从机械和电气设施到服务器机架，从存储到计算，再到核心的 Twine 容器编排器。幸运的是，这些架构在开发时都已将“断电容忍度”作为核心组件。例如，利用电池和断电警报（PLS）在机架断电时持久化内存数据就是一种能力；另一种能力是为 Twine 服务提供一种强大的、覆盖整个数据中心区域的异步信号机制，即不可用事件（UE）。（数据中心区域——下文简称“区域”——是指多个数据中心建筑位于同一地点，并共享公共网络和电力连接的区域）。

While these abilities were battle-tested and hardened on singular fault domains within single DCs, we identified outstanding vulnerabilities in scenarios encompassing an entire region. Also, testing a region required us to confront problems of not only scale (a typical region is normally 50-60x the size of the typical fault domains) and replica placement, but also of autonomous bootstrapping. Bootstrapping refers to kickstarting a powered-off region and requiring millions of services to start all at once and discover each other autonomously. 虽然这些能力在单个数据中心的单一故障域中经过了实战检验和加固，但我们发现在涵盖整个区域的场景中仍存在显著漏洞。此外，测试一个区域不仅需要面对规模问题（典型区域的规模通常是典型故障域的 50-60 倍）和副本放置问题，还需要面对自动引导（Bootstrapping）问题。引导是指启动一个断电的区域，并要求数百万个服务同时启动并自动发现彼此。

We describe two of the problems we encountered with bootstrapping below that required us to adopt a belt-and-braces approach to cover all possible eventualities and contingencies. A prominent one to call out — one that haunted us from our earliest days — is that of dependencies, and in particular the dreaded circular dependency, “ouroboros,” risk! Our Twine orchestrator has a set of control plane services — Scheduler, Allocator, Broker, Zelos (co-ordinator), and so on — without which we cannot run or start any other services in the region. While the risk from circular dependencies during regular operations is low, the risk and impact are far higher when bootstrapping an entire region. It’s a true chicken and egg problem. 我们在下文中描述了在引导过程中遇到的两个问题，这些问题要求我们采取“双重保险”的方法来覆盖所有可能的突发情况和应急预案。一个突出的问题——也是从我们早期就一直困扰我们的问题——是依赖关系，特别是令人恐惧的循环依赖，即“衔尾蛇”（ouroboros）风险！我们的 Twine 编排器有一组控制平面服务——调度器（Scheduler）、分配器（Allocator）、代理（Broker）、Zelos（协调器）等——没有它们，我们就无法在区域内运行或启动任何其他服务。虽然在正常运行期间循环依赖的风险较低，但在引导整个区域时，其风险和影响要高得多。这确实是一个“先有鸡还是先有蛋”的问题。

We solved this by identifying critical startup dependencies among the control plane services, and we continuously detect those early and often with Belljar tests in our CI / CD pipelines. These helped uncover and eliminate most, if not all, dependency risks before they are deployed to production. Given the rapid evolution of our Infra, and as a belt-and-braces solution, we also required the capability to break any circular dependencies that may have unexpectedly occurred. A purpose-built Twine recovery kit provides this “jumpstart” capability to recover those Twine services that power Twine itself. Together with Belljar and Twrko, we have been able to successfully put the specter of circular dependencies to rest. 我们通过识别控制平面服务之间的关键启动依赖关系解决了这个问题，并利用 CI/CD 流水线中的 Belljar 测试持续且频繁地进行早期检测。这有助于在部署到生产环境之前发现并消除大部分（如果不是全部）依赖风险。鉴于我们基础设施的快速演进，作为一种双重保险方案，我们还需要具备打破任何意外发生的循环依赖的能力。一个专门构建的 Twine 恢复工具包提供了这种“快速启动”能力，用于恢复那些支撑 Twine 本身的 Twine 服务。结合 Belljar 和 Twrko，我们已经成功消除了循环依赖的阴影。

We also encountered a “boomerang” problem in the same vicinity — the generator of a critical signal being impacted by the same signal. The UEs used to orchestrate shutdown and recovery of services ended up shutting down the orchestrator control plane services themselves, resulting in orphaned services that could not be “reaped” (because they never received a UE). While this problem could have been solved with intricate solutions such as excluding a preset set of services from the UE dispatch list, we decided to adopt a simpler and more sustainable approach by allowing control plane services to simply “ignore” shutdown signals associated with power-related UEs. The boomerang effect: The shutdown of Service-Z indirectly impacts the Twine Scheduler’s ability to orchestrate shutdowns. 我们还在同一领域遇到了“回旋镖”问题——关键信号的生成者被该信号本身所影响。用于编排服务关闭和恢复的 UE 最终关闭了编排器控制平面服务本身，导致孤立服务无法被“回收”（因为它们从未收到 UE）。虽然这个问题可以通过复杂的解决方案（例如从 UE 分发列表中排除一组预设服务）来解决，但我们决定采用一种更简单、更可持续的方法，即允许控制平面服务直接“忽略”与电力相关 UE 关联的关闭信号。回旋镖效应：Service-Z 的关闭间接影响了 Twine 调度器编排关闭的能力。

Tradeoffs Made When Striking the Right Balance Between Reliability and Velocity of Growth

在可靠性与增长速度之间取得平衡时所做的权衡

While it is feasible to build watertight tolerance to instant loss, this can come at opportunity costs for infra or risk overengineering our systems. The latter even has the potential to introduce risks of false positives impacting regular operations. Hence, we needed to make certain tradeoffs to strike the right balance between reliability and engineering. We began by drawing the line on which impacts must be avoided. Data loss of storage and database systems, permanent damage to DC facilities (mechanical/electrical), or sustained impact beyond a single region are some that we prominently noted as table-stake requirements. Transient service errors, rack failures (within a predefined threshold), and bounded staleness in service routing tables or in region unavailability detection (this is a hard problem for asynchronous systems) were… 虽然构建对瞬时损失的严密容忍度是可行的，但这可能会带来基础设施的机会成本，或导致系统过度工程化。后者甚至可能引入误报风险，从而影响正常运行。因此，我们需要做出一定的权衡，以在可靠性和工程设计之间取得正确的平衡。我们首先划定了必须避免的影响范围。存储和数据库系统的数据丢失、数据中心设施（机械/电气）的永久性损坏，或超出单个区域的持续影响，是我们明确列为基本要求（table-stake requirements）的事项。瞬时服务错误、机架故障（在预定义阈值内），以及服务路由表或区域不可用检测中的有限陈旧性（这对异步系统来说是一个难题）则被视为……