The Real AI Coding Breakthrough Is Not More Context. It Is Better Diagnostics.

The Real AI Coding Breakthrough Is Not More Context. It Is Better Diagnostics.

AI 编程真正的突破不在于更多的上下文,而在于更好的诊断。

When I started building what became Scarab Diagnostic Suite, I was not trying to create a theory of AI-assisted software development. I was trying to survive my own repo. I was building an intricate frontend/backend system with Codex. Not a toy app. Not a landing page. A real system with a public site, internal content architecture, member-facing surfaces, product logic, backend APIs, frontend routes, editorial workflows, operational concerns, and a lot of moving parts that all needed to keep agreeing with each other.

当我开始构建后来被称为 Scarab 诊断套件(Scarab Diagnostic Suite)的项目时,我并不是想建立一套 AI 辅助软件开发的理论,我只是想在自己的代码库中生存下来。当时我正在用 Codex 构建一个复杂的前后端系统。那不是一个玩具应用,也不是一个落地页,而是一个真正的系统,包含公共站点、内部内容架构、面向会员的界面、产品逻辑、后端 API、前端路由、编辑工作流、运营考量,以及许多需要保持协同一致的动态部分。

Codex could move fast. Very fast. It could generate files, wire components together, write migrations, patch errors, build routes, add tests, explain itself, and keep pushing the project forward. In the beginning, that felt like the whole promise of AI coding agents. Then the drift started. Not failure, exactly. Drift. That distinction matters. The agent could still produce code. The build could still pass. A test could still go green. The UI could still render something. But underneath that apparent progress, the repo would start losing its shape.

Codex 的速度非常快。它能生成文件、连接组件、编写迁移脚本、修复错误、构建路由、添加测试、自我解释,并不断推动项目前进。起初,这感觉就是 AI 编程代理所承诺的一切。但随后,“漂移”(drift)开始了。这并不完全是失败,而是漂移。这种区别很重要。代理仍然可以生成代码,构建仍然可以通过,测试仍然显示绿色,UI 仍然可以渲染内容。但在这种表面进展的背后,代码库开始失去其原有的结构。

A file would grow too large because the agent kept adding “just one more thing.” A helper would appear where a real module boundary should have existed. A frontend route would start owning behavior that belonged in the backend. A generated artifact would quietly become treated as source truth. A test would prove the patch instead of proving the intended behavior. A workaround would survive long enough to become architecture. The code worked. The repo was less coherent. That is where Scarab began.

一个文件会因为代理不断添加“再多一点东西”而变得过大。本该存在明确模块边界的地方出现了一个辅助函数。前端路由开始承担本属于后端的行为。生成的产物悄然被视为事实来源。测试证明的是补丁本身,而不是预期的行为。一种权宜之计存活得太久,最终变成了架构。代码虽然能运行,但代码库的连贯性却下降了。这就是 Scarab 的起源。

Not as another coding agent. Not as a replacement for tests. Not as a magic repair tool. Scarab began as a response to one very specific pressure: how do you keep an AI coding agent from slowly pulling a real software project out of alignment with itself? The problem was not that AI could not code—AI can code. That is no longer the interesting question. The more important question is whether AI can keep coding inside the shape of the system you are actually trying to build. That is much harder.

它不是另一个编程代理,不是测试的替代品,也不是什么神奇的修复工具。Scarab 的诞生是为了应对一个非常具体的问题:如何防止 AI 编程代理慢慢让一个真实的软件项目偏离其自身的设计轨道?问题不在于 AI 不会写代码——AI 完全可以。这已经不再是值得探讨的问题了。更重要的问题是,AI 能否在你真正想要构建的系统框架内持续编程。这要困难得多。

A repo is not just a pile of files. It has ownership. It has boundaries. It has rules. It has assumptions. It has places where truth is supposed to live. It has framework conventions. It has tests that are supposed to prove something meaningful. It has public behavior, internal behavior, generated behavior, runtime behavior, and maintainer expectations. When those layers agree, the system feels coherent. When they stop agreeing, the system starts drifting.

代码库不仅仅是一堆文件。它有所有权、边界、规则、假设。它有存放“事实”的地方,有框架约定,有旨在证明某些有意义内容的测试。它包含公共行为、内部行为、生成行为、运行时行为以及维护者的期望。当这些层面达成一致时,系统就会显得连贯;当它们不再一致时,系统就开始漂移。

The uncomfortable part is that we often ask AI to preserve context that humans have not actually made stable. We want the agent to know what matters. We want it to patch narrowly. We want it to respect architecture. We want it to understand which file owns which responsibility. We want it to know which tests are authoritative and which ones are incidental. We want it to know when a workaround is acceptable and when it is a quiet violation of the system. But very often, the repo itself does not clearly know those things.

令人不安的是,我们经常要求 AI 去维护那些人类自己都没有理顺的上下文。我们希望代理知道什么才是重要的,希望它精准地打补丁,希望它尊重架构,希望它理解哪个文件负责什么职责。我们希望它知道哪些测试是权威的,哪些是附带的。我们希望它知道什么时候权宜之计是可以接受的,什么时候它是在悄悄破坏系统。但通常情况下,代码库本身并没有清晰地定义这些东西。

The docs may be stale. The tests may only prove part of the behavior. The runtime may contradict the intended architecture. The issue may describe the symptom but not the boundary that failed. The human may know the rule intuitively, but the repo may not preserve that rule anywhere the agent can reliably obey. So we put an AI coding agent into an unstable field and act surprised when it drifts. But if the system has not defined its own boundaries, why would we expect the AI to enforce them?

文档可能已经过时,测试可能只证明了部分行为,运行时可能与预期的架构相矛盾。问题描述可能只说明了症状,而没有指出失效的边界。人类可能凭直觉知道规则,但代码库中可能没有任何地方能让代理可靠地遵循这些规则。因此,我们将 AI 编程代理放入一个不稳定的领域,却在它发生漂移时感到惊讶。但如果系统本身都没有定义好自己的边界,我们又怎能指望 AI 去强制执行它们呢?

More context is not the same as stable context

更多的上下文并不等同于稳定的上下文

A lot of the AI coding conversation is still orbiting the same set of answers: bigger context windows, smarter models, more autonomous agents, better prompts, stronger tools, more memory. Those things can help. But they do not solve the underlying problem by themselves. A larger context window can hold more text. It does not automatically decide which text is authoritative. A smarter model can reason better. It can still reason toward the wrong baseline. A more autonomous agent can move faster. It can also spread a bad assumption across more surfaces before anyone notices.

目前关于 AI 编程的讨论大多围绕着同一套答案:更大的上下文窗口、更聪明的模型、更自主的代理、更好的提示词、更强大的工具、更多的内存。这些东西确实有帮助,但它们本身并不能解决根本问题。更大的上下文窗口可以容纳更多文本,但它不会自动判断哪些文本是权威的。更聪明的模型可以更好地推理,但它仍然可能基于错误的基准进行推理。更自主的代理可以行动得更快,但也可能在任何人注意到之前,就将错误的假设扩散到更多的层面。

The problem is not simply that the AI needs more context. The problem is that the context needs ownership. Some context should not live as a fragile conversation with the model. Some context should not depend on whether the agent remembered a sentence from three prompts ago. Some context should not have to be re-explained every time the AI touches the repo. The repo should be able to say: This surface owns this behavior. This file is canonical. This pattern is deprecated. This test proves this boundary. This generated output is not source truth. This workaround is temporary. This framework convention must be preserved. This is what “done” means here.

问题不仅仅是 AI 需要更多的上下文,而是上下文需要“所有权”。有些上下文不应该仅仅作为与模型之间脆弱的对话而存在;有些上下文不应该依赖于代理是否记得三轮对话前的一句话;有些上下文不应该在 AI 每次触碰代码库时都被重新解释。代码库应该能够明确指出:这个界面负责这个行为;这个文件是规范的;这个模式已被弃用;这个测试证明了这个边界;这个生成的输出不是事实来源;这个权宜之计是暂时的;这个框架约定必须被保留;这就是这里所谓的“完成”的定义。

That is a very different posture from “let the AI figure it out.” It gives the AI a better job. Instead of asking the agent to be the architect, the historian, the maintainer, the tester, the reviewer, and the builder all at once, the repo can hold the system rules outside the model. Then the AI can do what it is actually good at: move quickly inside a bounded structure. That is when AI becomes more powerful, not less. The real breakthrough is not dumping more responsibility onto AI. It is taking the weight of unstable context off the AI.

这与“让 AI 自己去琢磨”的态度截然不同。它给了 AI 一份更好的工作。与其要求代理同时担任架构师、历史学家、维护者、测试员、审查员和构建者,不如让代码库将系统规则保留在模型之外。这样,AI 就能做它真正擅长的事情:在有边界的结构内快速行动。只有这样,AI 才会变得更强大,而不是更弱。真正的突破不是把更多的责任推给 AI,而是将不稳定的上下文负担从 AI 身上卸下来。

Diagnostics are not optional in serious systems

在严肃的系统中,诊断是必不可少的

There is a reason every imagined advanced system, from science fiction ships to humanoid machines, is always running diagnostics. Before the system acts, it checks itself. Are the subsystems online? Are the readings coherent? Is the signal real? Is the failure isolated? Which boundary is compromised? That pattern exists because it reflects something fundamental: serious systems need ways to verify their own operating state. Software is no different. And AI-assisted software needs this even more, not less. If an AI coding agent can touch dozens of files…

每一个想象中的先进系统,从科幻小说中的飞船到人形机器人,总是会运行诊断程序,这是有原因的。在系统采取行动之前,它会先进行自检:子系统是否在线?读数是否连贯?信号是否真实?故障是否被隔离?哪个边界受到了损害?这种模式的存在反映了一个基本事实:严肃的系统需要验证自身运行状态的方法。软件也不例外,而 AI 辅助的软件对此的需求甚至更高。如果一个 AI 编程代理可以触及数十个文件……