A protocol for auditing AI agent harnesses

AI 智能体架构审计协议

I have been building coding agents for the last several months and watching every component I added fail to move the resolve-rate I cared about. A verifier first. Multi-candidate sampling next. A structured-output sub-agent after that. Each was justified by a specific observed failure mode and each looked cheap at the margin. None of them helped. 过去几个月里，我一直在构建编程智能体，并观察到我添加的每一个组件都未能提升我所关注的任务解决率。先是验证器，接着是多候选采样，然后是结构化输出子智能体。每一个组件的引入都有特定的观察到的故障模式作为依据，且从边际成本来看似乎都很划算。但它们都没有起到作用。

The Tsinghua paper Natural-Language Agent Harnesses, run on SWE-bench Verified at GPT-5.4 high reasoning, explains the loss directly: a same-model verifier on top of a baseline coding agent regresses task success on OSWorld by 8.4 percentage points, and multi-candidate sampling regresses it by 5.6 the same way. Both lose for the same structural reason. The verifier and the proposer are the same model as the doer. They share its training distribution, its priors, its failure modes. When the doer is confidently wrong, the verifier endorses the wrong output with the same confidence. The check does not catch errors. It approves them. 清华大学的论文《自然语言智能体架构》（Natural-Language Agent Harnesses）在 GPT-5.4 高推理能力下对 SWE-bench Verified 进行测试，直接解释了这种性能损失：在基础编程智能体之上叠加同模型验证器，会导致 OSWorld 上的任务成功率下降 8.4 个百分点，而多候选采样则会导致 5.6 个百分点的下降。两者失败的结构性原因相同：验证器和提案者与执行者（doer）是同一个模型。它们共享相同的训练分布、先验知识和故障模式。当执行者自信地犯错时，验证器也会以同样的自信认可错误的输出。这种检查机制并没有捕捉到错误，反而是在批准错误。

The pattern generalises. Three papers from late March 2026 explain the failure mode, the rule that follows from it, and the audit you can run on a harness. Read in the right order they form a three-layer protocol that catches the verifier failure mode first, then predicts the rest of the ablation table without reference to the numbers. 这种模式具有普遍性。2026 年 3 月下旬发表的三篇论文解释了这种故障模式、由此得出的规则，以及你可以对架构进行的审计方法。按顺序阅读，它们构成了一个三层协议：首先捕捉验证器的故障模式，然后无需参考具体数值即可预测消融实验表中的其余结果。

tl;dr Tsinghua’s NLAH ablation, controlled at the module level: verifiers regress accuracy by 8.4pp on OSWorld; multi-candidate search by 5.6pp. Both lose for the same structural reason: they recycle the doer’s blind spots. The whole table follows from one rule. Harness modules that introduce a new signal win; modules that recycle the doer’s signal lose. The rule predicts every row without reference to the numbers. 简而言之：清华大学 NLAH 在模块层面的消融实验显示，验证器使 OSWorld 的准确率下降了 8.4 个百分点，多候选搜索下降了 5.6 个百分点。两者失败的结构性原因相同：它们循环利用了执行者的盲点。整个实验表遵循一条规则：引入新信号的架构模块会胜出；循环利用执行者已有信号的模块则会失败。这条规则无需参考数值即可预测每一行的结果。

Fudan’s AHE turns ablation into an edit-level audit. Each edit ships a manifest of predicted fixes and predicted regressions; the next iteration verifies it against task-level deltas; misses revert in git. Fix-precision is 33.7% (5x random). Regression-precision is 11.8% (2x random), and that asymmetry is the methodology’s open problem. 复旦大学的 AHE 将消融实验转化为编辑层面的审计。每次编辑都会发布一份包含预测修复和预测回归的清单；下一次迭代会根据任务层面的增量进行验证；未命中的部分则在 git 中回滚。修复精度为 33.7%（随机的 5 倍）。回归精度为 11.8%（随机的 2 倍），这种不对称性是该方法论目前尚未解决的问题。

Stanford’s Meta-Harness is upstream of both. The proposer fed raw failure traces hits 50.0% search-set accuracy; fed LLM summaries it hits 34.9%, statistically the same as scores only. Trace compression destroys roughly 15pp of optimisation signal. The protocol composes them by dependency: L0 trace utility first, then L1 module ablation, then L2 manifest verification on every subsequent edit. Get L0 wrong and L1/L2 collapse on degraded ground truth. 斯坦福大学的 Meta-Harness 处于上述两者的上游。当提案者输入原始故障追踪记录时，搜索集准确率达到 50.0%；输入 LLM 摘要时则降至 34.9%，在统计学上与仅输入分数的结果相同。追踪记录的压缩破坏了大约 15 个百分点的优化信号。该协议按依赖关系组合了这些方法：首先是 L0 追踪效用，然后是 L1 模块消融，最后是针对后续每次编辑的 L2 清单验证。如果 L0 出错，L1 和 L2 将会在退化的基础事实（ground truth）上崩溃。