Design Loops, Not Prompts

Design Loops, Not Prompts

设计循环,而非提示词

Agentic AI Design Loops, Not Prompts But don’t let the model check itself 智能体 AI:设计循环,而非提示词——但别让模型自我检查

“We don’t write prompts anymore. We design loops.” — someone at Anthropic in June 2026. “我们不再编写提示词了。我们设计循环。”——2026 年 6 月,Anthropic 的某位员工如是说。

In a self-correcting agent loop, self-critique did no better than doing nothing. A deterministic, source-anchored verifier cut the hallucination rate roughly in half. 在一个具备自我修正功能的智能体循环中,自我批判的效果并不比什么都不做更好。而一个确定性的、基于源文档的验证器将幻觉率降低了大约一半。

The line is from a few weeks ago and already feels true. We stopped tuning one perfect prompt and started building systems that try, check their own work, and improve over several steps. A model that can revise is worth more than a model that answers once and stops. On this, the line is right. 这句话出自几周前,但现在听起来已经成了真理。我们不再执着于调试一个完美的提示词,而是开始构建能够尝试、检查自身工作并在多个步骤中不断改进的系统。一个能够进行修订的模型,其价值远高于一个回答一次就停止的模型。在这一点上,这句话是对的。

What it leaves out is the bill. A loop is far harder to verify than a single call: with one call you check one output, but in a loop every step can drift, and the ways it can go wrong multiply with each iteration. The hard part stops being generation. It becomes verification. Or, if you prefer: knowing whether the loop is getting it right. And the default way to verify — let the model check its own work — turns out to be the weakest link in the chain. 但它忽略了代价。循环比单次调用更难验证:单次调用只需检查一个输出,但在循环中,每一步都可能发生偏移,且出错的方式会随着迭代次数的增加而倍增。难点不再是生成,而是验证。或者换句话说:如何确定循环是否走在正确的轨道上。而默认的验证方式——让模型检查自己的工作——恰恰成了链条中最薄弱的环节。

So this is not a quarrel with “design loops, not prompts.” It is the catch it hides, measured: the experiment that convinced me, with the numbers and the method, so you can check it yourself. 因此,我并非要反驳“设计循环,而非提示词”这一观点。我只是想指出它所隐藏的陷阱,并提供量化证据:通过我所做的实验、数据和方法,你可以亲自验证这一点。

The verification surface grows with every step

验证面随每一步骤而扩大

A single call has one place to be wrong: the answer. A three-step loop has the first draft, the critique of the draft, the revision, the critique of the revision, and the decision to stop. Each of those is a model output, and each can be confidently wrong. You did not remove the verification problem by adding a loop. You multiplied it. 单次调用只有一个出错点:答案本身。而一个三步循环则包含初稿、对初稿的批判、修订、对修订的批判,以及停止决策。其中每一个环节都是模型的输出,且每一个都可能自信地出错。你并没有通过增加循环来消除验证问题,反而将其成倍放大了。

The loop acts on its own verdicts. If the check says “good,” the loop stops and ships. If the check is wrong, the loop ships a mistake — and worse, it may keep polishing that mistake across iterations until it reads convincingly. A loop is only as trustworthy as the thing it verifies against. 循环基于自身的判断运行。如果检查结果显示“良好”,循环就会停止并输出结果。如果检查结果是错误的,循环就会输出一个错误的结果——更糟糕的是,它可能会在后续迭代中不断润色这个错误,直到它看起来令人信服。一个循环的可靠性,取决于它所依据的验证标准。

最薄弱的环节:模型自我评分

The most common verifier is the model itself. After drafting, you ask it: “Is this correct?” It is cheap, it needs no extra infrastructure, and it feels like reflection. The problem is what the model optimizes for. When an LLM grades its own output, it rewards answers that sound right. A confident, fluent, wrong answer sounds right too. So self-critique tends to wave through exactly the failures you most want to catch, and occasionally it talks itself out of a correct answer. There is no external truth in the loop — only the same distribution that produced the error, now asked to detect it. I wanted to measure it. 最常见的验证器就是模型本身。在起草完成后,你问它:“这正确吗?”这种方法成本低、无需额外基础设施,且看起来像是一种反思。问题在于模型的优化目标。当大语言模型(LLM)为自己的输出评分时,它会奖励那些“听起来正确”的答案。一个自信、流畅但错误的答案听起来也是正确的。因此,自我批判往往会放过你最想捕捉的错误,有时甚至会否定掉正确的答案。循环中不存在外部真理——只有产生错误的同一分布,现在却被要求去检测错误。我想量化这一点。

A different kind of check: deterministic and source-anchored

一种不同的检查方式:确定性与基于源文档

The alternative is a verifier that does not ask the model’s opinion at all. We have to consider two relevant properties: 另一种选择是使用完全不询问模型意见的验证器。我们必须考虑两个相关属性:

  • Source-anchored. The check measures whether an answer is grounded in a real source, not whether it reads well. If the answer drifts away from the source material, the verifier flags it — independent of how confident the prose sounds. 基于源文档(Source-anchored)。 检查衡量的是答案是否基于真实来源,而不是读起来是否通顺。如果答案偏离了源材料,验证器就会标记它——无论文字听起来多么自信。
  • Deterministic. Same input, same verdict, every time. You can inspect it, log it, and trust it across runs. A stochastic judge that changes its mind is not a foundation a loop can stand on. 确定性(Deterministic)。 相同的输入,每次都有相同的结论。你可以检查、记录并在多次运行中信任它。一个会改变主意的随机判断者,无法成为循环的基石。

The verifier I used is geometric. It embeds the question, the candidate answer, and the source on a vector hypersphere and reads the angles between them. A grounded answer sits close to its source; a hallucinated one drifts toward the question and away from the source. The Semantic Grounding Index (SGI) is a ratio of two such angles; a companion score (DGI) is a distributional grounding measure calibrated on held-out grounded pairs. Both are pure geometry over a fixed encoder, so they are deterministic by construction. 我使用的验证器是几何化的。它将问题、候选答案和源文档嵌入到一个向量超球面上,并读取它们之间的夹角。基于事实的答案会靠近其源文档;而幻觉答案则会向问题靠拢并远离源文档。语义基础指数(SGI)是这两个夹角的比率;配套的得分(DGI)是一种在留出数据对上校准的分布基础度量。两者都是基于固定编码器的纯几何计算,因此在构建上是确定性的。

The implementation is open source (Groundlens); the point of this article is not the math but what happens when you put such a check inside a loop. First, does the geometry even discriminate hallucinations? On the HaluEval QA benchmark, scoring grounded against hallucinated answers: 该实现已开源(Groundlens);本文的重点不在于数学,而在于当你将这种检查放入循环中时会发生什么。首先,这种几何方法真的能区分幻觉吗?在 HaluEval QA 基准测试中,对基于事实的答案与幻觉答案进行评分的结果如下:

Verifier signalAUROC95% CI
SGI0.769[0.715, 0.821]
DGI0.939[0.911, 0.964]
SGI + DGI0.949[0.926, 0.971]

Table 1: Detection on n = 300 answer pairs; bootstrap confidence intervals. The combined signal separates grounded from hallucinated answers cleanly. That is the precondition. Now the question is whether a check this accurate, placed inside a loop, actually makes the loop’s final answers better than self-critique does. 表 1:n = 300 个答案对的检测结果;自助法置信区间。组合信号能清晰地将基于事实的答案与幻觉答案区分开来。这是前提条件。现在的问题是,将如此精确的检查放入循环中,是否真的能使循环的最终答案优于自我批判的结果。

The experiment

实验

The design isolates one variable: what the loop verifies against (Figure 1). 该设计隔离了一个变量:循环依据什么进行验证(图 1)。

Figure 1. Experiment setup 图 1. 实验设置

A generator answers factual questions closed-book — from its own memory, with no source in front of it — so it hallucinates often and a verifier has something to fix. Each question runs through four arms, and a cross-model referee grades every final answer, so no model judges itself in the scoring: 生成器在闭卷状态下回答事实性问题——仅凭自身记忆,面前没有源文档——因此它经常产生幻觉,从而为验证器提供了修正对象。每个问题通过四个分支运行,并由一个跨模型裁判对每个最终答案进行评分,因此在评分过程中没有模型会自我评价:

  1. Open-book reference — the generator is simply handed the source. No check. This is the ceiling. 开卷参考——直接给生成器提供源文档。无检查。这是性能上限。
  2. Single (closed-book) — one answer, no check. This is the floor. 单次(闭卷)——回答一次,无检查。这是性能下限。
  3. Self-critique — closed-book; the model judges its own answer and revises until it is satisfied (up to three iterations). 自我批判——闭卷;模型判断自己的答案并进行修订,直到满意为止(最多三次迭代)。
  4. Source-anchored — closed-book; the geometric verifier scores the answer, and on a flag it injects the source and asks for a grounded rewrite (up to three iterations). 基于源文档——闭卷;几何验证器对答案评分,如果触发标记,则注入源文档并要求进行基于事实的重写(最多三次迭代)。

Setup, for reproduction: generator Claude Opus 4.8; referee GPT-5.5 (cross-model grading); benchmark HaluEval QA; encoder all-MiniLM-L6-v2; temperature=0 (if available); seed=0; loop thresholds calibrated on the model’s own closed-book training drafts; n=120 items through the loops. 复现设置:生成器 Claude Opus 4.8;裁判 GPT-5.5(跨模型评分);基准测试 HaluEval QA;编码器 all-MiniLM-L6-v2;温度=0(若可用);随机种子=0;循环阈值基于模型自身的闭卷训练草稿进行校准;循环处理 n=120 个项目。

One asymmetry is deliberate. And it is the whole point: the source-anchored arm has access to a source of truth through its verifier, and the self-critique arm does not. The hypothesis under test is not “geometry beats self-critique with the same information.” It is “a source-anchored verifier turns a hallucinating closed-book generator into a gr…” 这种不对称是刻意为之的,这也是核心所在:基于源文档的分支通过验证器获得了真理来源,而自我批判分支则没有。所测试的假设并非“在相同信息下,几何方法优于自我批判”,而是“一个基于源文档的验证器能将一个产生幻觉的闭卷生成器转化为一个……”