Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale
Capacity Efficiency at Meta: How Unified AI Agents Optimize Performance at Hyperscale
Meta 的容量效率:统一 AI 智能体如何实现超大规模性能优化
By Tommy Tran, Michael Zetune 作者:Tommy Tran, Michael Zetune
We’re sharing insights into Meta’s Capacity Efficiency Program, where we’ve built an AI agent platform that helps automate finding and fixing performance issues throughout our infrastructure. By leveraging encoded domain expertise across a unified, standardized tool interface these agents help save power and free up engineers’ time away from addressing performance issues to innovating on new products. 我们在此分享 Meta“容量效率计划”(Capacity Efficiency Program)的洞察。我们构建了一个 AI 智能体平台,旨在帮助自动化发现并修复整个基础设施中的性能问题。通过在统一、标准化的工具接口中利用编码后的领域专业知识,这些智能体不仅有助于节省电力,还能将工程师从处理性能问题中解放出来,从而专注于新产品的创新。
We’ve built a unified AI agent platform that encodes the domain expertise of senior efficiency engineers into reusable, composable skills. These agents now automate both finding and fixing performance issues, recovering hundreds of megawatts (MW) of power and compressing hours of manual regression investigation into minutes, enabling the program to scale MW delivery across a growing number of product areas without proportionally scaling headcount. 我们构建了一个统一的 AI 智能体平台,将资深效率工程师的领域专业知识编码为可重用、可组合的技能。这些智能体现在能够自动发现并修复性能问题,回收了数百兆瓦(MW)的电力,并将数小时的人工回归调查压缩至几分钟内完成,使该计划能够在不断增加的产品领域中扩展电力节省规模,而无需按比例增加人力。
On defense, FBDetect, Meta’s in-house regression detection tool, catches thousands of regressions weekly; faster automated resolution means fewer megawatts wasted compounding across the fleet. On offense, AI-assisted opportunity resolution is expanding to more product areas every half, handling a growing volume of wins that engineers would never get to manually. Together, this is how Meta’s Capacity Efficiency Program keeps growing MW delivery without proportionally growing the team. The end goal is a self-sustaining efficiency engine where AI handles the long tail. 在防御方面,Meta 的内部回归检测工具 FBDetect 每周能捕获数千个回归问题;更快的自动化解决意味着减少了在整个集群中累积浪费的电力。在进攻方面,AI 辅助的优化方案解决能力每半年就会扩展到更多产品领域,处理了大量工程师原本无法手动完成的优化工作。总之,这就是 Meta 容量效率计划如何在不按比例增加团队规模的情况下,持续提升电力节省成果的方法。最终目标是建立一个自我维持的效率引擎,由 AI 来处理长尾问题。
Here’s how it works and where we’re headed: Efficiency at hyperscale requires both offense (proactively finding optimizations) and defense (catching and mitigating regressions that make it to production); AI can accelerate both. We’ve built a unified platform where standardized tool interfaces combine with encoded domain expertise to automate investigation on both sides. These AI systems are now the infrastructure for the Capacity Efficiency program, which has recovered hundreds of megawatts of power, enough to power hundreds of thousands of American homes for a year. Automating diagnoses can compress ~10 hours of manual investigation into ~30 minutes, while AI agents fully automate the path from efficiency opportunity to ready-to-review pull request. 以下是其工作原理及未来方向:超大规模的效率提升既需要“进攻”(主动寻找优化点),也需要“防御”(捕获并缓解进入生产环境的回归问题);AI 可以加速这两者。我们构建了一个统一平台,将标准化的工具接口与编码后的领域专业知识相结合,从而实现双方调查的自动化。这些 AI 系统现已成为容量效率计划的基础设施,已回收了数百兆瓦的电力,足以供数十万美国家庭使用一年。自动化诊断可以将约 10 小时的人工调查压缩至约 30 分钟,而 AI 智能体则能完全自动化从发现效率机会到生成待审核代码合并请求(Pull Request)的全过程。
Introducing the Capacity Efficiency Program
引入容量效率计划
When the code you ship serves more than 3 billion people, even a 0.1% performance regression can translate to significant additional power consumption. In Meta’s Capacity Efficiency organization, we see efficiency as a two-sided effort: 当您发布的代码服务于超过 30 亿用户时,哪怕是 0.1% 的性能回归也可能转化为巨大的额外电力消耗。在 Meta 的容量效率部门,我们将效率视为一项双向工作:
- Offense: searching for opportunities (proactive code changes) to make our existing systems more efficient, and deploying them.
- 进攻: 寻找机会(主动进行代码变更)以提高现有系统的效率,并进行部署。
- Defense: monitoring resource usage in production to detect regressions, root-cause them to a pull request, and deploy mitigations.
- 防御: 监控生产环境中的资源使用情况,以检测回归,追溯其根源至具体的代码合并请求,并部署缓解措施。
These systems worked well and have played an important role in Meta’s efficiency efforts for years. However, actually resolving the issues they surface introduces a new bottleneck: human engineering time. This human engineering time can be spent on any of the following activities: 这些系统运行良好,多年来在 Meta 的效率工作中发挥了重要作用。然而,实际解决它们发现的问题引入了一个新的瓶颈:人工工程时间。这些人工时间通常花费在以下活动中:
- Querying profiling data to find opportunities to optimize hot functions.
- 查询性能分析数据,寻找优化热点函数的契机。
- Reviewing an efficiency opportunity’s description, documentation, and past examples to understand the best approach for implementing an optimization.
- 审查效率优化机会的描述、文档和过往案例,以了解实施优化的最佳方案。
- Checking recent code and configuration deployments that could have caused a step change in resource usage.
- 检查近期可能导致资源使用量阶梯式变化的代码和配置部署。
- Looking through recent internal discussions about launches that might have been related to a regression.
- 查阅近期关于可能与回归相关的产品发布的内部讨论。
Many engineers at Meta use our efficiency tools to work on these problems every day. But no matter how high-quality the tooling is, engineers have limited time to address performance issues when innovating on new products is our top priority. We started asking: What if AI could handle investigation and resolution? Meta 的许多工程师每天都在使用我们的效率工具来处理这些问题。但无论工具质量有多高,当创新新产品成为我们的首要任务时,工程师处理性能问题的时间总是有限的。我们开始思考:如果 AI 能处理调查和解决过程会怎样?
Offense and Defense Share the Same Structure
进攻与防御共享相同的结构
The breakthrough was realizing that both problems share the same structure: This meant we didn’t need two separate AI systems. We needed one platform that could serve both. We built it on two layers: 突破点在于我们意识到这两个问题具有相同的结构:这意味着我们不需要两个独立的 AI 系统,而是需要一个能同时服务两者的平台。我们将其构建在两个层面上:
- MCP Tools: These are standardized interfaces for LLMs to invoke code. Each tool does one thing: query profiling data, fetch experiment results, retrieve configuration history, search code, or extract documentation.
- MCP 工具: 这些是供大语言模型(LLM)调用代码的标准接口。每个工具只做一件事:查询性能分析数据、获取实验结果、检索配置历史、搜索代码或提取文档。
- Skills: These encode domain expertise about performance efficiency. A skill can tell an LLM which tools to use and how to interpret results. It captures reasoning patterns that experienced engineers developed over years, such as “consult the top GraphQL endpoints for endpoint latency regressions” or “look for recent schema changes if the affected function handles serialization.”
- 技能: 这些编码了关于性能效率的领域专业知识。一项技能可以告诉 LLM 使用哪些工具以及如何解读结果。它捕捉了资深工程师多年来积累的推理模式,例如“针对端点延迟回归查询顶层 GraphQL 端点”或“如果受影响的函数处理序列化,则检查最近的模式(Schema)变更”。
Together, tools and skills promote a generalized language model into something that can apply the domain expertise typically held by senior engineers. The same tools can power both offense and defense. Only the skills differ. 工具与技能相结合,将通用的语言模型提升为能够应用资深工程师所具备的领域专业知识的系统。相同的工具可以同时支持进攻和防御,只有技能有所不同。
Defense: Catching Regressions Before They Compound
防御:在回归累积前将其捕获
FBDetect is Meta’s in-house regression detection tool that can catch performance regressions as small as 0.005% in noisy production environments. It analyzes time series data like this: When FBDetect finds a regression, we immediately attempt to root-cause it to a code or configuration change; this is a vital first step to understand what happened. It’s done primarily with traditional techniques such as correlating regression functions with recent pull requests. After a root cause is determined, engineers are typically notified and expected to take action, such as optimizing the recent code change. FBDetect 是 Meta 的内部回归检测工具,能够在嘈杂的生产环境中捕获小至 0.005% 的性能回归。它通过分析时间序列数据来工作:当 FBDetect 发现回归时,我们会立即尝试将其根源追溯到代码或配置变更;这是理解问题发生原因的关键第一步。这主要通过传统技术完成,例如将回归函数与最近的代码合并请求进行关联。确定根源后,通常会通知工程师并期望他们采取行动,例如优化最近的代码变更。
We’ve added an additional feature to make this faster: AI Regression Solver 我们增加了一项新功能以加快这一过程:AI 回归求解器
Our AI Regression Solver is the newest and most promising component of FBDetect, which produces a pull request to fix forward the regression automatically. Traditionally, root-causes (pull requests) that created performance regressions were either rolled back (slowing engineering velocity) or ignored (increasing infrastructure resource use unnecessarily). Now, our in-house coding agent is activated to… 我们的 AI 回归求解器是 FBDetect 中最新且最有前景的组件,它能自动生成一个代码合并请求来修复回归问题。传统上,导致性能回归的根源(代码合并请求)要么被回滚(降低了工程速度),要么被忽略(导致基础设施资源不必要的增加)。现在,我们的内部编码智能体被激活以……