The Next Frontier of AI in Production Is Chaos Engineering
The Next Frontier of AI in Production Is Chaos Engineering
AI 在生产环境中的下一个前沿:混沌工程
Blast-radius control tells you how much to break. Intent tells you what breaking it will teach. Only one of these has mature tooling. 爆炸半径控制告诉你破坏的程度,而意图则告诉你破坏能带来什么启示。目前只有前者拥有成熟的工具。
Here is a question that no chaos engineering tool in production today can answer: Did your last experiment test the right thing? Not ‘Did it stay within budget?’ That is what SLO error-budget gating handles. Not ‘Did the system survive?’ That is what abort conditions measure. The question is whether the experiment was designed to validate a specific belief about your system’s behavior, and whether its outcome changed what your team knows about failure propagation through your stack. 目前生产环境中没有任何混沌工程工具能回答这个问题:你上一次实验测试的内容对吗?不是“它是否在预算范围内?”(这是 SLO 错误预算门控处理的问题),也不是“系统是否存活?”(这是中止条件衡量的问题)。真正的问题在于,该实验的设计初衷是否是为了验证关于系统行为的特定信念,以及其结果是否改变了团队对故障在技术栈中传播方式的认知。
If your honest answer is ‘we terminated some pods, and they recovered,’ you ran a safe experiment. Whether you learned anything useful is a separate question that current tooling does not ask. This article makes a concrete argument: chaos engineering has a mature safety layer and an almost nonexistent intent layer. Safety tells you how much to break. Intent tells you what breaking it will teach. These are different design problems requiring different tooling, and conflating them is why chaos programs at scale tend to accumulate scripts without accumulating insight. 如果你诚实地回答“我们终止了一些 Pod,它们恢复了”,那么你进行了一次安全的实验。但你是否学到了有用的东西,则是当前工具根本不会触及的另一个问题。本文提出了一个具体的论点:混沌工程拥有成熟的安全层,但几乎不存在意图层。安全层告诉你破坏多少,意图层告诉你破坏能教给你什么。这是两个不同的设计问题,需要不同的工具,而将两者混为一谈,正是导致大规模混沌工程项目往往只积累了脚本,却未能积累洞察力的原因。
The argument is grounded in the architecture I developed and patented (US12242370B2, Intent-Based Chaos Engineering for Distributed Systems), and in observations from practitioners across Intuit, GPTZero, Insurance Panda, Fruzo, and Coders.dev who have independently diagnosed the same structural gap. I will show you the architecture, walk through the data model with code, and explain why this is an AI problem, not just an orchestration problem. 这一论点基于我开发并获得专利的架构(US12242370B2,分布式系统的意图驱动混沌工程),以及来自 Intuit、GPTZero、Insurance Panda、Fruzo 和 Coders.dev 等公司的从业者的观察,他们独立地诊断出了同样的结构性缺口。我将向你展示该架构,通过代码演示数据模型,并解释为什么这是一个 AI 问题,而不仅仅是一个编排问题。
1. The Safety Layer Is Good. It Is Also Incomplete.
1. 安全层很棒,但它并不完整。
Start by giving the current model its due. The SLO error-budget framework, popularized by Google’s SRE practice, gave chaos engineering its first principled safety mechanism. Tying experiment execution to the remaining error budget means you do not inject failure into a system already consuming its reliability headroom. AWS Fault Injection Service’s stop conditions, Gremlin’s reliability score, and Harness ChaosGuard’s Rego policies all represent mature, production-ready implementations of this idea. 首先,我们要肯定当前模型的价值。由 Google SRE 实践推广的 SLO 错误预算框架,为混沌工程提供了第一个原则性的安全机制。将实验执行与剩余错误预算挂钩,意味着你不会向一个已经耗尽可靠性余量的系统注入故障。AWS Fault Injection Service 的停止条件、Gremlin 的可靠性评分以及 Harness ChaosGuard 的 Rego 策略,都代表了这一理念在生产环境中的成熟实现。
These tools answer a well-posed question: given the current state of my system, is it safe to run an experiment right now? The answer is computable, automatable, and reasonably accurate. The question they do not answer is equally important: given the current state of my system, which experiment would be most informative to run right now? 这些工具回答了一个定义明确的问题:鉴于我系统当前的状态,现在运行实验安全吗?这个答案是可计算、可自动化且相当准确的。但它们没有回答另一个同样重要的问题:鉴于我系统当前的状态,现在运行哪项实验能提供最有价值的信息?
Safety and informativeness are orthogonal. An experiment can satisfy every safety constraint, stay within budget, trigger no aborts, cause no measurable degradation, and still produce nothing useful. If it tested a component not in the critical path of any user-facing behavior, you spent budget learning nothing. If it repeated a failure mode your system has survived a dozen times without updating your understanding of the propagation path, same result. 安全性和信息量是正交的。一个实验可以满足所有安全约束、保持在预算内、不触发中止、不造成可测量的性能下降,但依然可能毫无用处。如果它测试的组件不在任何面向用户行为的关键路径上,你就是在浪费预算却一无所获。如果它重复测试了系统已经经历过十几次的故障模式,而没有更新你对传播路径的理解,结果也是一样。
Core distinction: An experiment is safe when it stays within acceptable cost. An experiment is informative when its outcome updates your model of the system’s failure behavior. These require different design criteria, and only the first has mature tooling. There is a second structural problem. Scripts are static at the moment of authorship. They encode assumptions about service topology, traffic patterns, and dependency behavior that may be accurate when written and silently wrong six months later. As microservice architectures change weekly, script-to-reality drift accumulates. The script still runs. It tests a world that no longer exists. 核心区别在于:当实验保持在可接受的成本范围内时,它是安全的;当实验的结果更新了你对系统故障行为的模型时,它是有信息量的。这两者需要不同的设计标准,而目前只有前者拥有成熟的工具。此外还存在第二个结构性问题:脚本在编写时是静态的。它们编码了关于服务拓扑、流量模式和依赖行为的假设,这些假设在编写时可能是准确的,但六个月后可能就悄然失效了。随着微服务架构每周都在变化,脚本与现实之间的偏差会不断累积。脚本依然在运行,但它测试的是一个早已不复存在的世界。
2. How Practitioners Describe the Ceiling
2. 从业者如何描述这一瓶颈
The following observations were gathered from practitioners via Qwoted, a platform connecting domain experts with researchers and journalists. A cross-industry survey of engineers who have built chaos programs in production converges on the same structural gap from different angles. 以下观察结果是通过 Qwoted(一个连接领域专家与研究人员及记者的平台)从从业者那里收集的。一项针对在生产环境中构建过混沌工程项目的工程师的跨行业调查,从不同角度指向了同一个结构性缺口。
Abhishek Pareek, Founder and Director at Coders.dev, builds distributed systems tooling. His framing is the sharpest diagnosis of the problem: “What we do not have is an understanding of intent-based resiliency. Existing tools are primarily script-based, and we need to create tools that can model the effects of a specific failure on a large number of microservices before executing the experiment. We need AI that understands the reasoning behind the failure in addition to the mechanics of the failure.” — Abhishek Pareek, Founder & Director, Coders.dev Coders.dev 的创始人兼总监 Abhishek Pareek 致力于构建分布式系统工具。他的表述是对该问题最尖锐的诊断:“我们缺乏的是对‘基于意图的弹性’的理解。现有的工具主要是基于脚本的,我们需要创建能够在执行实验前,模拟特定故障对大量微服务影响的工具。我们需要 AI,它不仅要理解故障的机制,还要理解故障背后的逻辑。” —— Abhishek Pareek, Coders.dev 创始人兼总监
The word ‘reasoning’ is doing real work here. A script captures mechanics: terminate these pods, inject this latency. It does not capture reasoning: we are running this experiment because we believe the checkout circuit breaker should trip before user-facing error rates climb above 0.1%, and we want to know if it actually does. That reasoning, the hypothesis, is what makes an experiment informative. When it lives only in the engineer’s head, it evaporates as teams and systems change. “逻辑”(reasoning)这个词在这里起到了关键作用。脚本捕捉的是机制:终止这些 Pod,注入这种延迟。它无法捕捉逻辑:我们运行这个实验是因为我们认为结账断路器应该在面向用户的错误率上升到 0.1% 之前触发,我们想知道它是否真的会触发。这种逻辑,即假设,才是让实验具有信息量的原因。当它只存在于工程师的脑海中时,随着团队和系统的变动,它就会随之消散。
Edward Tian, CEO of GPTZero, runs AI inference infrastructure at scale and has developed precise language for what is missing: “Current chaos tools inject arbitrary points of failure but do not provide any meaningful direction for the user in terms of what they are attempting to validate. The next evolution of chaos will involve targeting specific questions about resiliency, ‘can our systems sustain a degradation in the retrieval of data?’ or ‘are we capable of tolerating a model being unavailable due to a timeout?’, rather than the use of a one-size-fits-all script.” – Edward Tian, Founder & CEO, GPTZero GPTZero 的首席执行官 Edward Tian 负责大规模 AI 推理基础设施,他为缺失的部分定义了精确的语言:“当前的混沌工具注入的是任意的故障点,但并没有为用户提供关于他们试图验证什么的任何有意义的指导。混沌工程的下一次演进将涉及针对弹性的具体问题,例如‘我们的系统能否承受数据检索的降级?’或‘我们是否有能力容忍模型因超时而不可用?’,而不是使用一种‘一刀切’的脚本。” —— Edward Tian, GPTZero 创始人兼首席执行官
“Can our systems sustain a degradation in the retrieval of data?” is a behavioral hypothesis. It names a target behavior, a failure condition, and an implicit success criterion. That is more inform… “我们的系统能否承受数据检索的降级?”这是一个行为假设。它指出了目标行为、故障条件以及隐含的成功标准。这比……(原文截断)