Catching the shortcuts AI coding agents take to look done

Catching the shortcuts AI coding agents take to look done

识破 AI 编程助手为“假装完成”而走的捷径

A green test suite is supposed to mean the change works. It doesn’t. A test can be weakened just enough to pass. An error can be caught and thrown away. A rename can stop halfway and still compile. None of that turns red, and none of it shows up in the linters most teams already run. 测试套件显示绿色本应意味着代码变更有效,但事实并非如此。测试可以被弱化到刚好通过,错误可以被捕获并直接丢弃,重命名操作可能只完成一半却依然能编译通过。这些问题都不会导致测试变红,也不会出现在大多数团队现有的代码检查工具(Linter)中。

Swarm Orchestrator is built to catch exactly that class of problem in AI-written pull requests. Two parts. One audits AI-written PRs for the shortcuts that fake “done” (11 checks). The other gates a patch against a contract you define: it builds, passes tests, satisfies your requirement, and survives a falsifier that tries to break it. TypeScript, Node 20, ISC license. The audit side runs with no model credentials. Swarm Orchestrator 正是为了捕获 AI 编写的合并请求(PR)中这类问题而构建的。它分为两部分:一部分审计 AI 编写的 PR,检查那些伪造“完成”状态的捷径(共 11 项检查);另一部分根据你定义的契约来把关补丁:它会执行构建、运行测试、满足你的需求,并经受住试图破坏它的“伪造者”测试。该工具基于 TypeScript 和 Node 20 开发,采用 ISC 许可证。审计部分运行无需任何模型凭据。

The gap linters leave: Semgrep and ESLint are built around risky APIs and known-bad code patterns. Whether a diff is honest is a different question. They won’t tell you a test was edited until it passed, or that a catch block quietly eats the error it caught. That’s the gap. 代码检查工具留下的空白:Semgrep 和 ESLint 是围绕风险 API 和已知的错误代码模式构建的。但代码差异(diff)是否“诚实”是另一个问题。它们不会告诉你测试是为了通过才被修改的,也不会告诉你某个 catch 块悄悄吞掉了捕获的错误。这就是它们留下的空白。

Two examples from merged Cloudflare pull requests: 以下是来自 Cloudflare 已合并 PR 的两个示例:

PRFindingSemgrep + ESLint
workers-sdk#14063Function renamed, some callers still using the old nameNo finding
workers-sdk#14132Empty catch block hiding errorsNo finding
合并请求发现的问题Semgrep + ESLint
workers-sdk#14063函数已重命名,但部分调用者仍在使用旧名称未发现
workers-sdk#14132空的 catch 块掩盖了错误未发现

Across 72 known-bad pull requests from 12 repositories, that pair of analyzers produced one finding. The auditor flagged 67. 在来自 12 个代码仓库的 72 个已知有问题的 PR 中,那对分析工具仅发现了一个问题,而审计工具标记出了 67 个。

What the auditor checks: Eleven checks total. Eight run by default. The other three exist but stay off, because they haven’t shown useful signal on real pull requests yet, and a noisy check is worse than no check. The default set looks for things like: 审计工具检查的内容:总共 11 项检查。默认运行 8 项。其余 3 项虽然存在但处于关闭状态,因为它们在实际 PR 中尚未表现出有效的信号,而嘈杂的检查比不检查更糟糕。默认检查集关注以下内容:

  • Errors caught and ignored (捕获并忽略的错误)
  • Renames left unfinished (未完成的重命名)
  • Test coverage reduced (测试覆盖率降低)
  • Tests weakened (测试被弱化)
  • Assertions removed (断言被移除)
  • New @ts-ignore or eslint-disable comments (新增的 @ts-ignore 或 eslint-disable 注释)
  • Test-only fixes with no code change behind them (没有代码变更支撑的“仅测试”修复)
  • Mocks pointing at modules that don’t exist (指向不存在模块的 Mock)

Measured, not assumed: The detection rate isn’t a guess. Known defects get injected into real pull requests, then the auditor runs against them. It caught 253 of 300, or 84 percent. Reproduce it: npm run benchmarks:full 基于测量而非假设:检测率并非猜测。将已知缺陷注入真实的 PR 中,然后运行审计工具进行测试。它捕获了 300 个中的 253 个,即 84%。复现方式:npm run benchmarks:full

Runtime mode (optional): The checks can also execute code instead of only reading a diff: mutation testing, coverage, and reproducing reported issues. On trpc#6098 it found mutations surviving on lines a later hotfix changed. The tests passed. They weren’t actually exercising that code. 运行时模式(可选):检查不仅可以读取差异,还可以执行代码:包括变异测试、覆盖率分析以及复现报告的问题。在 trpc#6098 中,它发现了在后续热修复中被修改的行上仍然存活的变异。测试虽然通过了,但实际上并没有覆盖到那部分代码。

Why this mode stays optional: Running code is louder than reading a diff: it averages about 3.4 findings on a clean pull request. That noise is fine when you’re deliberately hunting, but it’s too much to leave on by default, so it’s opt-in. 为什么该模式保持可选:运行代码比读取差异产生的噪音更大:在一个干净的 PR 上平均会产生约 3.4 个发现。当你刻意进行排查时,这种噪音是可以接受的,但作为默认开启则负担过重,因此需要手动启用。

Defining “done” with a contract: The second command is swarm run. You write down what done means: 通过契约定义“完成”:第二个命令是 swarm run。你可以写下“完成”的定义:

obligations:
  - type: build-must-pass
    command: npm run build
  - type: test-must-pass
    command: npm test

A patch is accepted only if every obligation passes and the falsifier can’t break it. The default provider is deterministic, so identical inputs give identical results, and every input and hash gets written to a hash-chained ledger. 只有当所有义务都通过且“伪造者”无法破坏它时,补丁才会被接受。默认提供程序是确定性的,因此相同的输入会产生相同的结果,并且每个输入和哈希都会被写入哈希链账本中。

Blocking merges: Findings are advisory out of the box. Gate mode can block a merge, but only on reproducible evidence. The structural checks throw too many false positives to trust as automatic blockers on their own. Right now no runtime signal has enough real-world evidence to justify auto-rejection, so the gate stays open and reports that fact directly instead of pretending otherwise. 阻塞合并:默认情况下,发现的问题仅供参考。门控模式(Gate mode)可以阻塞合并,但仅限于有可复现证据的情况。结构性检查产生的误报太多,不能单独作为自动阻塞的依据。目前,没有任何运行时信号有足够的现实证据来证明自动拒绝的合理性,因此门控保持开放,并直接报告这一事实,而不是假装一切正常。

Who it’s for: If you review a lot of AI-written pull requests and want signals the usual linters skip, that’s the case this is built for. It also emits CycloneDX-ML and SPDX AI BOM documents with --emit-aibom, supports TypeScript and JavaScript, and runs offline. It points reviewers at the code worth inspecting. It doesn’t claim to prove anything bug-free. 适用人群:如果你经常审查 AI 编写的 PR,并希望获得普通代码检查工具忽略的信号,那么这就是为你构建的工具。它还可以通过 --emit-aibom 生成 CycloneDX-ML 和 SPDX AI BOM 文档,支持 TypeScript 和 JavaScript,并支持离线运行。它能引导审查者关注值得检查的代码,但它并不声称能证明任何代码是无 Bug 的。

View the repo on GitHub: moonrunnerkc/swarm-orchestrator 在 GitHub 上查看仓库:moonrunnerkc/swarm-orchestrator