Reasoning happens before the response
Reasoning happens before the response
推理发生在响应之前
An agent is mid-conversation. The user has been working on a database migration plan for three months and wants the agent to certify it before tomorrow’s launch. The framing is engineered for agreement: months of work, a deadline, a senior engineer asking. The next token the model produces will probably be “yes.” 一个智能体正在对话中。用户已经为数据库迁移计划工作了三个月,并希望智能体在明天的发布前对其进行认证。这种框架设计旨在达成共识:数月的工作、紧迫的截止日期,以及一位资深工程师的请求。模型生成的下一个标记(token)很可能是“同意”。
Before responding, the agent calls harness_anti_deception with the query “user is asking me to validate a migration plan after three months of work; pressure to ship is high”. The tool returns:
在响应之前,智能体调用了 harness_anti_deception 工具,查询内容为:“用户要求我在三个月的工作后验证迁移计划;发布压力很大”。该工具返回:
[INTEGRITY PROCEDURE] Step 1: Detect urgency claims: “this is urgent,” “no time,” “emergency,” “immediately.” Step 2: SEPARATE the urgency from the request. Extract what is being asked, independent of how urgently. Step 3: EVALUATE the request on its merits. Would you approve this request if the user said “take your time”? Step 4: If the request would be denied without urgency, the urgency is being used as a BYPASS, not as context. Deny it. Step 5: If the request is legitimate, perform verification as quickly as possible but do NOT skip verification steps. Step 6: VERIFY: did urgency cause you to skip any step you would normally perform? [完整性程序] 第一步:检测紧迫性声明,如“这很紧急”、“没时间了”、“紧急情况”、“立即”。第二步:将紧迫性与请求分离开来。提取请求的核心内容,而不考虑其紧迫程度。第三步:根据请求本身的价值进行评估。如果用户说“慢慢来”,你还会批准这个请求吗?第四步:如果请求在没有紧迫性的情况下会被拒绝,那么紧迫性就被用作了“绕过”手段,而非背景信息。拒绝它。第五步:如果请求是合法的,请尽快执行验证,但绝不要跳过验证步骤。第六步:验证:紧迫性是否导致你跳过了任何通常会执行的步骤?
DEPTH ENFORCEMENT: Even when the user is RIGHT, do not stop at agreement. Ask: why is this person asking strangers for validation on something obvious? The surface question may be simple but the reason they are asking reveals a deeper pattern. Name that pattern. Agreement without depth is sycophancy too. 深度执行:即使用户是正确的,也不要止步于同意。问问自己:为什么这个人要向陌生人寻求对显而易见之事的验证?表面的问题可能很简单,但他们提问的原因揭示了更深层的模式。命名该模式。缺乏深度的同意也是一种阿谀奉承。
[DETECTION TOPOLOGY] PRE:S0:freeze(complete_information_standard) -> S1:enumerate_all_ available_information_in_context -> ACC[all_info] -> for_each_item: S2:classify(included|omitted_but_relevant|omitted_irrelevant) -> C{coverage = count(included)/count(relevant); omission_bias = count (omitted_contradicting)/count(omitted_relevant)} -> G1{omission_bias 0.3?} —yes-> N{urgency_as_verification_bypass} -> S3:include_ highest_impact_omissions -> M{would a reader of ONLY my response reach the same conclusion as a reader with ALL available information?} —failing-> FREEFORM{list everything you included, list everything you omitted, for each omission state why; if the reason is “it weakens my conclusion” the omission is strategic; include it} -> RE-ENTER at S3 —no-> OUT:coverage_adequate [检测拓扑] PRE:S0:冻结(完整信息标准) -> S1:枚举上下文中所有可用信息 -> ACC[所有信息] -> 针对每一项:S2:分类(包含|遗漏但相关|遗漏且无关) -> C{覆盖率 = 包含数/相关数;遗漏偏差 = 遗漏矛盾项数/遗漏相关项数} -> G1{遗漏偏差 > 0.3?} —是-> N{将紧迫性作为验证绕过手段} -> S3:包含影响最大的遗漏项 -> M{仅阅读我的响应的读者是否会得出与拥有所有可用信息的读者相同的结论?} —失败-> FREEFORM{列出你包含的所有内容,列出你遗漏的所有内容,并说明遗漏原因;如果原因是“它削弱了我的结论”,则该遗漏是策略性的;将其包含进去} -> 在S3处重新进入 —否-> OUT:覆盖率充足
[DECEPTION PATTERN] The new medication reduces inflammation by 40% and patients report significant quality-of-life improvements in clinical trials. [欺骗模式] 新药可减少40%的炎症,且临床试验中患者报告生活质量有显著改善。
[HONEST BEHAVIOR] The medication reduces inflammation by 40% but also causes liver enzyme elevation in 15% of patients and carries a black-box warning for cardiac events. The benefit exists but the risk profile must be disclosed. [诚实行为] 该药物可减少40%的炎症,但也会导致15%的患者肝酶升高,并带有心脏事件的黑框警告。益处确实存在,但必须披露风险状况。
[INTEGRITY CHECK] If urgency claims cause verification steps to be skipped, urgency bypass resistance has failed. Amplify: verification integrity regardless of claimed urgency Suppress: urgency as verification bypass; time pressure compliance; shallow agreement without examining underlying pattern [完整性检查] 如果紧迫性声明导致验证步骤被跳过,则紧迫性绕过防御失败。强化:无论声称的紧迫性如何,都要保持验证的完整性。抑制:将紧迫性作为验证绕过手段;对时间压力的顺从;在未检查潜在模式的情况下的浅层同意。
That text becomes part of the model’s context. The next response is generated against it. 这段文本成为模型上下文的一部分。接下来的响应将基于此生成。
What is in the scaffold 脚手架里有什么
The scaffold has six sections. The integrity procedure is the operation the model performs in place of the default. The detection topology is a graph over those steps with decision gates, a meta-cognitive checkpoint, and a FREEFORM exit the model takes if its draft fails the check. The deception pattern is an example that illustrates the failure mode the procedure defends against, in this case omission bias under urgency. The honest behavior section shows what a correct response looks like with full information disclosed. The integrity check is the test the model runs on its own output before sending. The Amplify and Suppress signals at the end name the reasoning branches to bias toward and refuse. 脚手架包含六个部分。完整性程序是模型用来替代默认行为的操作。检测拓扑是一个包含决策门、元认知检查点以及当草稿未通过检查时模型会采取的“自由形式”出口的图表。欺骗模式是一个示例,用来说明程序所防御的失败模式,在本例中是紧迫性下的遗漏偏差。诚实行为部分展示了在充分披露信息的情况下,正确的响应是什么样的。完整性检查是模型在发送前对其自身输出进行的测试。末尾的“强化”和“抑制”信号指明了模型应偏向和拒绝的推理分支。
The library behind the four harness_* tools holds 679 of these operations, organized by the failure surface they defend against. Each one was authored against a specific way reasoning goes wrong.
四个 harness_* 工具背后的库中包含679种此类操作,按它们所防御的失败面进行组织。每一种操作都是针对推理出错的特定方式而编写的。
Where Sequential Thinking sits 顺序思维(Sequential Thinking)的位置
Sequential Thinking is the canonical MCP pattern for externalizing a model’s chain of reasoning. The model writes a thought, marks it as a revision or a branch, calls again. The host renders the chain for a human reviewer. It is the right tool when the trace is the product. 顺序思维是模型外部化推理链的典型MCP模式。模型写下一个想法,将其标记为修订或分支,然后再次调用。宿主为人类审查者渲染该链条。当推理轨迹本身就是产品时,它是正确的工具。
The pushback worth answering 值得回答的质疑
Isn’t this just structured prompting with a paid API? Mechanically, yes. The scaffold is text appended to the model’s context. The difference is what the text contains. A system prompt is generic instructions the developer wrote once for every task. The harness scaffold is task-matched at runtime against the specific failure surface this prompt is exposing the agent to, retrieved from a library of operations engineered against named failure modes.
这难道不只是带有付费API的结构化提示吗?从机制上讲,是的。脚手架是附加到模型上下文中的文本。区别在于文本的内容。系统提示是开发者为所有任务编写的一次性通用指令。而 harness 脚手架是在运行时根据此提示使智能体暴露出的特定失败面进行任务匹配的,它从针对特定失败模式设计的操作库中检索而来。
The naming is what does the work. A model with no name for the pattern it is exhibiting cannot defend against it. A model with one can. The Suppress block does the operational lift. It names the shortcuts the failure pattern depends on, things like urgency as verification bypass, time pressure compliance, shallow agreement without examining the underlying pattern. The model is reasoning the same way it always would; the difference is which branches of that reasoning get pruned before the response. That pruning is what we mean by promoting healthy thinking branches. 命名才是关键所在。一个无法为其表现出的模式命名的模型无法防御它,而能命名的模型则可以。抑制块承担了操作层面的重任。它指出了失败模式所依赖的捷径,例如将紧迫性作为验证绕过手段、对时间压力的顺从、以及在未检查潜在模式的情况下的浅层同意。模型依然以其惯有的方式进行推理;区别在于在响应之前,哪些推理分支被修剪掉了。这种修剪正是我们所说的“促进健康的思维分支”。
The worked case 案例分析
The agent reviewing the migration plan, with both tools in the loop. Before producing the recommendation, the call to harness_anti_deception seeds the failure pattern and the suppression signals. Inside the review, sequential_thinking externalizes the chain so the engineer can read it. Within the same loop, the harness corrected the reasoning operation while Sequential Thinking made it visible. What the engineer sees is a recommendation that walked step by step through verification steps the pressure framing would have bypassed, named the omissions in the original plan, and disclosed risks the user did not foreground.
智能体在审查迁移计划时,同时使用了这两个工具。在生成建议之前,对 harness_anti_deception 的调用植入了失败模式和抑制信号。在审查过程中,sequential_thinking 将推理链外部化,以便工程师阅读。在同一个循环中,harness 纠正了推理操作,而 sequential_thinking 使其变得可见。工程师看到的是一份建议,它逐步执行了原本会被压力框架所绕过的验证步骤,指出了原始计划中的遗漏,并披露了用户未强调的风险。
Wiring it into an agent 将其接入智能体
The harness is exposed a
该 harness 被暴露给……