How to align coding agents with your plans better than markdown, without burning tokens
How to align coding agents with your plans better than markdown, without burning tokens
如何比 Markdown 更有效地与编程智能体对齐计划,且不浪费 Token
The expensive moments in a coding-agent session are not the model’s tokens. They are the seconds you spend skimming a markdown plan and missing a subtle misalignment. You approve, then watch the implementer solve a slightly different problem than the one in your head. We have started treating that gap as a UI problem, not a model problem. And the UI we have, for coding agents specifically, is bad. 在编程智能体(coding-agent)的会话中,昂贵的开销并非来自模型的 Token,而是你花费数秒浏览 Markdown 计划却忽略了细微偏差的时刻。当你批准了计划,却发现执行者解决的问题与你脑海中的设想略有不同。我们开始将这种差距视为 UI 问题,而非模型问题。而目前针对编程智能体所提供的 UI,体验非常糟糕。
Thariq Shihipar at Claude Code has been making this case publicly for a while: agents should be emitting HTML, not markdown, for most non-trivial output. His thread is the right primer on why, and we’re not going to try to re-derive it here. What we want to add is the piece that has been missing for us. We needed a way to use HTML at every plan stage without the token cost stacking up across the session. That way is a screenshot, borrowed from how DeepSeek-OCR handles context compression. Claude Code 的 Thariq Shihipar 已经在公开场合多次强调:对于大多数非琐碎的输出,智能体应该生成 HTML 而非 Markdown。他的推文串是理解这一点的绝佳入门,我们在此不再赘述。我们想补充的是我们一直缺失的那一环:我们需要一种在每个计划阶段都能使用 HTML,且不会导致 Token 成本在整个会话中累积的方法。这个方法就是截图,借鉴了 DeepSeek-OCR 处理上下文压缩的方式。
The arguments worth restating here are the ones the rest of this post leans on: Markdown won by inertia. It rendered everywhere, was easy for a human to hand-edit, and the kinds of plans agents used to produce were short. None of that still binds. Most people are no longer hand-editing agent-generated specs, they are prompting the agent to edit them. Plans have grown into full RFCs. And every modern reviewer has a browser tab open. 这里值得重申的论点是本文后续讨论的基础:Markdown 的胜出是因为惯性。它随处可渲染,易于人工手动编辑,且智能体过去生成的计划都很短。但这些条件已不再适用。大多数人不再手动编辑智能体生成的规范,而是通过提示词让智能体去修改。计划已经演变成完整的 RFC 文档,且现代审查者随时都开着浏览器标签页。
HTML carries information markdown cannot. Tables with real column alignment, SVG diagrams drawn to scale, before/after panels rendered side by side at the same visual weight. In the absence of those, agents fall back to ASCII boxes and unicode block characters approximating colors. That fallback is what most markdown plans actually look like at length, and it is why nobody reads past line 100. Information density matters most at the plan stage. This is where the gap between what the agent thinks you want and what you actually want is widest. HTML 承载了 Markdown 无法表达的信息:具有真实列对齐的表格、按比例绘制的 SVG 图表、并排渲染且视觉权重一致的“修改前/修改后”面板。在缺乏这些功能时,智能体只能退而求其次,使用 ASCII 方框和 Unicode 块字符来模拟颜色。这就是大多数长篇 Markdown 计划的真实面貌,也是为什么没人愿意读完 100 行之后内容的原因。信息密度在计划阶段最为重要,因为这正是智能体认为你想要的与你实际想要的差距最大的地方。
Forcing the plan through a flat-text encoding is a lossy compression step you do not need to be performing. Thariq catalogs the use cases: plan stages with branching options, design and prototype reviews, PR walkthroughs, code and architecture explainers, throwaway custom editors that end with a “copy as JSON” button. We have ended up using HTML for all of those. Our experience matches his closely enough that the right move is to point you at his thread rather than re-list them. 强迫计划通过纯文本编码进行呈现,是一种你本不需要执行的有损压缩步骤。Thariq 列举了多种用例:带有分支选项的计划阶段、设计与原型审查、PR 演练、代码与架构解释器,以及以“复制为 JSON”按钮结尾的临时自定义编辑器。我们最终在所有这些场景中都使用了 HTML。我们的经验与他的高度吻合,因此最好的做法是引导你去阅读他的推文串,而不是在这里重复列举。
Where this landed for us: design work with a coding agent. The plan-stage argument is the one that converted us, and design work is where it shows up most starkly. The last time we were iterating on a UI change with Claude Code, we asked for the plan as a single-file HTML artifact instead of the usual markdown. Two columns, BEFORE on the left, AFTER on the right, rendered with the real tokens and chrome the UI actually ships. 这对我们的影响在于:与编程智能体进行设计工作。关于计划阶段的论点说服了我们,而设计工作正是这一论点体现得最明显的地方。上次我们使用 Claude Code 迭代 UI 变更时,我们要求将计划生成为单文件 HTML 工件,而不是通常的 Markdown。左侧显示“修改前”,右侧显示“修改后”,并使用 UI 实际交付的真实 Token 和界面元素进行渲染。
The point is not the specific feature. The point is that one artifact got us to high-fidelity comprehension in a single round trip. The markdown equivalent would have been a paragraph of prose and a bullet list. Readable, but lossy in exactly the ways that matter for a visual change. Getting to the same level of confidence through markdown would have taken three or four back-and-forth turns of “what does this look like next to X” and “show me the spacing,” each one re-tokenizing the conversation and giving us a worse mental picture than the rendered comparison did instantly. 重点不在于具体功能,而在于这一个工件让我们在单次往返中就达成了高保真的理解。如果用 Markdown,那将是一段文字描述加一个列表。虽然可读,但在视觉变更的关键点上却是“有损”的。通过 Markdown 达到同样的信心水平,需要三四次“这在 X 旁边看起来怎么样?”和“给我看看间距”的反复沟通,每一次都会重新消耗对话 Token,且带给我们的心理图像远不如渲染后的对比图直观。
The expensive operation is reading the spec and noticing what the agent got wrong. Spending model tokens on rendered HTML pays for itself the first time it replaces three turns of “what does this look like next to X” with one look. 最昂贵的操作是阅读规范并发现智能体的错误。在渲染后的 HTML 上花费模型 Token 是物有所值的,因为它能让你一眼就看清,从而省去了三次“这在 X 旁边看起来怎么样?”的沟通。
Where Thariq’s argument gets harder: token cost on long sessions. HTML is not free. A single artifact comparing two design approaches with inline styles, SVG, and full content runs roughly four to six times the tokens of the equivalent markdown plan. Generation also takes two to four times longer. On a one-shot artifact that’s fine. On a long coding-agent session, the plan gets re-read by the implementer, then the reviewer, then the follow-up planner. The HTML keeps getting re-tokenized into context, and the cost stacks up across the session. Thariq 的论点在长会话的 Token 成本上面临挑战:HTML 并非免费。一个包含内联样式、SVG 和完整内容的对比设计方案工件,其 Token 消耗量大约是同等 Markdown 计划的 4 到 6 倍。生成时间也长了 2 到 4 倍。对于一次性工件这没问题,但在长期的编程智能体会话中,计划会被执行者、审查者以及后续的规划者反复读取。HTML 会不断被重新 Token 化进入上下文,导致成本在整个会话中不断累积。
The fix came from a different research direction. DeepSeek-OCR is the missing mechanism. DeepSeek-AI’s paper DeepSeek-OCR: Contexts Optical Compression makes a simple claim: a page of text rendered as an image and processed by a vision encoder can be encoded into far fewer tokens than the same text processed as text. Their model card lists the encoding modes. A 1024x1024 image of a full page becomes 256 vision tokens. Their Tiny mode does it in 64. For content that has visual structure, the image channel encodes more per token than the text channel by a wide margin. 解决方案来自另一个研究方向:DeepSeek-OCR 是缺失的机制。DeepSeek-AI 的论文《DeepSeek-OCR: Contexts Optical Compression》提出了一个简单的观点:将一页文本渲染为图像并由视觉编码器处理,其编码所需的 Token 远少于将相同内容作为文本处理。他们的模型卡列出了编码模式:一张 1024x1024 的全页图像仅需 256 个视觉 Token,Tiny 模式甚至只需 64 个。对于具有视觉结构的内容,图像通道在每个 Token 承载的信息量上远超文本通道。
You do not need to run their model to borrow the mechanism. Once you have an HTML artifact you are happy with, you do not need to keep the HTML itself in context for subsequent agent calls. Render it, screenshot it, feed the PNG back as an image. The vision tokens encode the same spec at a fraction of the text-token cost, and the human-readable HTML is preserved on disk for the next time you need to iterate. 你不需要运行他们的模型也能借用这一机制。一旦你得到了满意的 HTML 工件,就不必在后续的智能体调用中保留 HTML 本身。将其渲染、截图,然后将 PNG 作为图像传回。视觉 Token 以极低的文本 Token 成本编码了相同的规范,而人类可读的 HTML 则保存在磁盘上,供下次迭代时使用。
The workflow we have settled into: Agent generates the HTML artifact as part of the plan stage. We open it in a browser, review, edit if needed, approve. A small wrapper renders the artifact and captures a PNG. Subsequent agent calls receive the PNG as part of the spec, not the raw HTML. The trade is asymmetric. Our review happens against the rendered HTML, where spacing, alignment, and color do the work of catching the misalignments. The model’s re-reads across the implementer and reviewer stages happen again. 我们最终确定的工作流是:智能体在计划阶段生成 HTML 工件。我们在浏览器中打开它,进行审查、按需编辑并批准。一个小型的封装程序负责渲染工件并截取 PNG。后续的智能体调用将接收 PNG 作为规范的一部分,而不是原始 HTML。这种权衡是不对称的:我们的审查基于渲染后的 HTML,通过间距、对齐和颜色来捕捉偏差;而模型在执行者和审查者阶段的重新读取则通过图像完成。