A green test suite proves less than you think
A green test suite proves less than you think
绿色的测试套件证明的东西比你想象的要少
The test that scared me was the one that passed. I had an integration test for a routing agent, the kind that takes a task and picks a capability to handle it. The test registered a new capability at runtime and then checked that the router would eventually route to it. Green run after run. Solid. I trusted it. 最让我后怕的测试,反而是那个通过了的测试。我曾为一个路由代理编写过集成测试,这种代理负责接收任务并选择相应的能力来处理它。测试在运行时注册了一个新能力,然后检查路由器最终是否会将其路由到该能力。测试一次又一次地显示绿色通过。非常稳固。我信任它。
Then I read it properly. It reused the same task string on every iteration of the loop. My scorer was deterministic by design, it hashed the task and indexed into the capability list, so a fixed string mapped to a fixed slot, and the newly registered capability lived at a different slot that the fixed string could never reach. The test asserted that the new capability got selected. The new capability was structurally unreachable. And the assertion passed anyway, because the loop happened to land on something registered every time, which was all the weak version of the check actually demanded. 后来我仔细阅读了代码,发现它在循环的每次迭代中都重复使用了同一个任务字符串。我的评分器在设计上是确定性的:它对任务进行哈希处理并索引到能力列表中,因此固定的字符串总是映射到固定的槽位,而新注册的能力位于另一个槽位,该固定字符串永远无法触及。测试断言新能力被选中了,但实际上新能力在结构上是不可达的。然而断言还是通过了,因为循环恰好每次都落在了某个已注册的项上,而这正是那个薄弱的检查逻辑所要求的全部。
The test was not testing what it said it was testing. It was green for a reason that had nothing to do with the thing I cared about. The fix was almost insulting in its smallness, vary the task strings so the hash spreads across every slot including the new one, and suddenly the test could fail when the feature was broken, which is the entire point of a test. One line. I had been shipping false confidence behind a checkmark. 这个测试并没有测试它声称要测试的内容。它显示绿色通过的原因,与我真正关心的逻辑毫无关系。修复方案简单得近乎侮辱人:只需改变任务字符串,让哈希值分布到包括新槽位在内的每一个位置,测试就能在功能损坏时报错——这才是测试的全部意义所在。仅仅一行代码。我一直以来都在用一个虚假的“对勾”传递着虚假的信心。
That is the moment this whole piece is about. Not the bug. The checkmark. The number that lies. Here is the setup that produces this every time. You build an agent. You write unit tests. You watch line coverage climb to ninety-something percent. CI turns green. You deploy. And within a week the thing is making nonsensical decisions under load, falling over on inputs you never imagined a user would send, and getting stuck in loops your loop detector cannot see because two threads stepped on each other’s state at the same instant. 这就是本文想要探讨的时刻。不是关于那个 Bug,而是关于那个“对勾”。那个会撒谎的数字。以下是导致这种情况的典型流程:你构建了一个代理,编写了单元测试,看着代码覆盖率攀升到百分之九十多,CI 显示绿色,然后你部署了。结果不到一周,系统在高负载下做出荒谬的决策,在用户意想不到的输入下崩溃,并陷入循环检测器无法察觉的死循环中,原因仅仅是两个线程在同一瞬间干扰了彼此的状态。
The unit tests were not lying to you. The functions genuinely worked in isolation. That is the trap. Line coverage measures whether your tests executed a line, not whether they cornered it. You can run every line in a file and assert nothing that matters about any of them, exactly like my integration test ran its loop and asserted the wrong thing. A green suite built on coverage tells you your tests touched the code. It tells you almost nothing about whether the code survives contact with production. 单元测试并没有对你撒谎。这些函数在隔离状态下确实能正常工作。这就是陷阱所在。代码覆盖率衡量的是测试是否执行了某行代码,而不是是否对其进行了充分的边界测试。你可以运行文件中的每一行代码,却不对其中任何关键逻辑进行断言,这正如我的集成测试运行了循环却断言了错误的结果一样。一个基于覆盖率的绿色测试套件只能告诉你:测试触碰了代码。它几乎无法告诉你代码在面对生产环境时是否依然稳健。
And autonomous systems, agents that route, retry, fall back, remember, do not fail in isolated functions. They fail in the seams between functions. They fail where two modules meet and disagree about a type. They fail on the input the author never pictured. They fail when two requests arrive at once. They fail when a dependency dies and the system panics instead of limping. They fail on the edge case nobody wrote down. Coverage walks straight past all five, because every one of those failures lives in territory a unit test is structurally built to avoid. 而自主系统——那些会路由、重试、回退、记忆的代理——它们不会在孤立的函数中失败。它们失败于函数之间的缝隙。它们失败于两个模块对接时对类型理解的不一致。它们失败于作者从未设想过的输入。它们失败于两个请求同时到达时。它们失败于依赖项挂掉而系统直接崩溃而非降级运行时。它们失败于无人记录的边缘情况。覆盖率指标对这五种失败视而不见,因为这些失败中的每一个都存在于单元测试在结构上刻意回避的领域。
Five seams, five suites
五个缝隙,五套测试
The shift that changed how I test agents was to stop asking “did my tests run the code” and start asking “what are the distinct ways this system actually breaks, and do I have a suite aimed at each one.” Five answers came back, and they are genuinely distinct failure classes, not five flavors of the same check. None of these dimensions is mine to claim as an invention, they are long-standing testing practice, integration testing, adversarial and fuzz testing, concurrency testing, fault injection, and property-based testing each have decades of prior art behind them. 改变我测试代理方式的转变在于:不再问“我的测试是否运行了代码”,而是问“这个系统实际上会以哪些不同的方式崩溃,我是否为每一种方式都准备了对应的测试套件”。我得到了五个答案,它们是真正不同的失败类别,而不是同一种检查的五种变体。这些维度并非我的发明,它们是长期存在的测试实践:集成测试、对抗性与模糊测试、并发测试、故障注入以及基于属性的测试,每一项都有数十年的先例可循。
The engineering distinctive is narrower and more honest, it is recognizing that an autonomous agent needs all five aimed at it at once, because it can fail in all five ways in a single week, and that a coverage number cannot stand in for any of them. 工程上的独特之处在于更狭窄、更诚实:即认识到自主代理需要同时针对这五个维度进行测试,因为它可能在一周内以这五种方式全部失败,而覆盖率数字无法替代其中任何一个。
The first seam is integration, where modules compose. The most common bug in a multi-module system is not “function X has wrong logic,” it is “X works fine but Y expected a different type,” or “A only works if B was set up first.” Mocks paper over exactly this, they return what you told them to and never enforce the real interface, which is how my same-string test slept through a real defect. 第一个缝隙是集成,即模块组合的地方。多模块系统中最常见的 Bug 不是“函数 X 逻辑错误”,而是“X 工作正常但 Y 期望不同的类型”,或者“A 只有在 B 先设置好的情况下才能工作”。Mock(模拟对象)掩盖了这一点,它们只返回你预设的值,从不强制执行真实的接口,这就是为什么我的“相同字符串”测试会漏掉一个真实的缺陷。
The second is adversarial input, the gap between the task you imagined and the task a real user sends, the hundred-thousand-character string, the embedded newline carrying a fake directive, the injection attempt, the empty string, the wall of emoji. The contract is not that nothing weird arrives. It is that weird input gets a safe answer or an honest error, never a crash and never a leak. 第二个是对抗性输入,即你设想的任务与真实用户发送的任务之间的差距:十万字符的字符串、带有虚假指令的嵌入式换行符、注入尝试、空字符串、满屏的 Emoji。契约不是“不会收到奇怪的东西”,而是“奇怪的输入能得到安全的响应或诚实的错误提示,绝不会崩溃,也绝不会泄露”。
The third is concurrency, the races that only appear when many requests hit shared state at once. A history list, a registry, a loop detector, anything two threads can write without a lock, will silently corrupt under load in a way no single-threaded test will ever reproduce. 第三个是并发,即只有当大量请求同时命中共享状态时才会出现的竞态条件。历史记录列表、注册表、循环检测器,任何两个线程可以在没有锁的情况下写入的东西,在高负载下都会以单线程测试永远无法复现的方式悄悄损坏。
The fourth is failure cascade, what happens when the pieces an agent depends on, the registry, the scorer, the loop detector, start dying. A naive build lets any one failure crash the whole call. A real one degrades, and the failure you actually have to test is all of them dying at once, because real outages are correlated and take down several things together. 第四个是故障级联,即当代理所依赖的组件(注册表、评分器、循环检测器)开始挂掉时会发生什么。幼稚的构建会让任何一个故障导致整个调用崩溃。成熟的系统会降级运行,而你真正需要测试的是它们同时挂掉的情况,因为真实的故障往往是关联的,会同时带走多个组件。
The fifth is property-based testing, where instead of writing examples you state an invariant and let a generator hunt thousands of inputs for the one that breaks it. The invariants that look obvious, “routing always returns a real capability or a clean error, nothing in between,” are exactly the ones a generated single-character task or a Unicode combining sequence quietly violates. 第五个是基于属性的测试,即不再编写具体的示例,而是声明一个不变量,让生成器去寻找数千个输入,找出那个能破坏它的输入。那些看起来显而易见的不变量,比如“路由总是返回一个真实的能力或一个清晰的错误,没有中间状态”,恰恰是生成的单字符任务或 Unicode 组合序列会悄悄破坏的。
What the checkmark should mean
“对勾”应该意味着什么
No single one of these dimensions catches everything, and that is the whole argument. Integration finds the type-contract and setup-order bugs and is blind to races. Adversarial finds the injection and the boundary crash and never sees a component failure. Concurrency finds the race and ignores the malformed input. 这些维度中没有哪一个能捕捉到所有问题,这正是本文的核心论点。集成测试能发现类型契约和设置顺序的 Bug,但对竞态条件视而不见;对抗性测试能发现注入和边界崩溃,却看不到组件故障;并发测试能发现竞态,却忽略了格式错误的输入。