Build software that heals itself in the agentic era

在智能体时代构建能够自我修复的软件

Disclosure: I build MailKite, and the open-source mail-parse library I use as the example is ours. But the pattern is the point — it isn’t MailKite-specific, and you can apply it to anything that eats messy input. Self-healing software is a system architected so that, when it hits input the real world throws at it, it doesn’t crash and it doesn’t stay broken: it records a structured, PII-free failure signature, and that signature feeds a repair loop — increasingly, an AI agent — that turns the breakage into a permanent fix behind automated gates.

披露： 我是 MailKite 的开发者，文中用作示例的开源 mail-parse 库也是我们的产品。但重点在于这种模式——它并非 MailKite 所特有，你可以将其应用于任何处理混乱输入的系统。所谓“自我修复软件”，是指一种架构良好的系统：当它遇到现实世界中各种杂乱的输入时，既不会崩溃，也不会保持损坏状态；相反，它会记录一个结构化的、不含个人隐私信息（PII）的故障签名，并将该签名反馈给一个修复循环（目前越来越多地由 AI 智能体担任），从而在自动化门控机制下将故障转化为永久修复。

In the agentic era the bottleneck is no longer writing the fix; a capable agent can do that. The bottleneck is architecting your software so an agent’s fix is safe, automatic, and cumulative. This post is that pattern. I’ll use our open-source MIME parser (mail-parse) as the running example — messy input is where software goes to die — but the shape applies to almost any system that eats hostile real-world data.

在智能体时代，瓶颈不再是编写修复代码；一个强大的智能体完全可以做到这一点。真正的瓶颈在于如何架构你的软件，使得智能体的修复过程是安全、自动且可累积的。本文将介绍这种模式。我将以我们的开源 MIME 解析器（mail-parse）作为贯穿示例——混乱的输入往往是软件的“坟墓”——但这种架构模式适用于几乎所有处理复杂现实数据的系统。

Two honesty notes before I start, because a post that blurs shipped and planned isn’t worth reading. First: this is part one of a two-part series — part one is the architecture and what runs today; part two comes after the fully autonomous loop ships and we’ve watched it heal real input in the wild. Second: I’ll label each piece shipped or in progress as I go, and there’s a status table at the end.

在开始之前，有两点需要诚实说明，因为一篇混淆了“已发布”与“计划中”的文章是不值得阅读的。第一：这是两篇系列文章的第一篇——第一篇主要介绍架构和当前运行的功能；第二篇将在全自动修复循环发布并经过真实环境验证后推出。第二：我会在文中标记每一部分是“已发布”还是“进行中”，并在文末附上状态表。

The loop the agentic era changes

智能体时代改变的修复循环

The classic repair loop is slow and human-shaped: a bug slips into production → someone eventually files an issue → a human reproduces it, writes a patch, ships a release → weeks later every install benefits. It works, but it’s measured in weeks and gated on a human being in the loop for every single fix. Agents change what’s possible here, not by being trusted to write perfect code, but by being fast and tireless at the boring middle.

传统的修复循环缓慢且依赖人工：Bug 进入生产环境 → 有人最终提交 Issue → 人类开发者复现、编写补丁、发布版本 → 数周后所有安装实例受益。这种方式虽然有效，但周期以周为单位，且每一次修复都必须有人类参与。智能体改变了这种可能性，它们并非因为能写出完美代码而受信任，而是因为它们在处理那些枯燥的中间环节时既快速又不知疲倦。

The interesting question stops being “can an agent write the fix?” (increasingly, yes) and becomes: when an agent can propose a fix in seconds, how do you build software so that letting it do so isn’t reckless? Answer that, and your system stops accumulating breakage — every new way the world is wrong becomes a one-time event. Five design moves make it work. I’ll state each generally, then ground it in the parser.

有趣的问题不再是“智能体能写出修复代码吗？”（答案越来越倾向于肯定），而是：当智能体能在几秒钟内提出修复方案时，你该如何构建软件，使得采纳这些方案不再是鲁莽的行为？回答了这个问题，你的系统就不会再不断累积故障——世界上的每一种新错误都将变成一次性事件。以下五个设计步骤可以实现这一点。我将先进行概括，然后结合解析器进行说明。

1. Never crash — turn every failure into a structured signal

1. 永不崩溃——将每一次失败转化为结构化信号

The foundation of a self-healing system is that failure is a first-class, structured output, not an exception that unwinds the stack. If your software dies on bad input, there’s nothing to heal; if it silently mangles it, there’s nothing to detect. The discipline is: always produce the best result you can, and alongside it a machine-readable record of everything you had to paper over.

自我修复系统的基础在于：失败是一种一等公民式的结构化输出，而不是导致堆栈回溯的异常。如果你的软件因错误输入而崩溃，那就无从修复；如果它静默地处理错误导致数据损坏，那就无从检测。其准则是：始终尽力产生最佳结果，并同时附带一份机器可读的记录，详细说明所有你不得不“打补丁”处理的地方。

In the parser (shipped): mail-parse never throws. An unclosed MIME boundary pops the orphaned context and emits BOUNDARY_NOT_CLOSED; a charset that won’t decode falls back and emits UNKNOWN_CHARSET. You always get a message and a typed list of what was wrong with it. Those diagnostics aren’t logging — they’re the raw material every downstream loop runs on.

在解析器中（已发布）：mail-parse 从不抛出异常。未闭合的 MIME 边界会弹出孤立上下文并发出 BOUNDARY_NOT_CLOSED 信号；无法解码的字符集会回退并发出 UNKNOWN_CHARSET 信号。你总是能得到一条消息以及一份关于它哪里出错的类型化列表。这些诊断信息不是日志——它们是下游修复循环运行所需的原材料。

import { parse } from "@mailkite/mail-parse"; 
// parse() never throws — even on a broken message it returns a best-effort 
// result *plus* a typed list of everything it had to paper over. 
const msg = parse(rawMime); 
msg.subject; // decoded as far as it could 
msg.attachments; // whatever it could recover 
msg.diagnostics; // → [ 
// { code: "BOUNDARY_NOT_CLOSED", scope: "structure" }, 
// { code: "UNKNOWN_CHARSET", scope: "part", contentType: "text/html" }, 
// ]

2. Make fixes additive, not surgery — a plugin seam

2. 让修复成为累加式而非手术式——插件接口

If every fix means editing the core, fixes are risky, they collide, and no agent (or human) should be trusted to make them at speed. The move is a registry: a seam where new behavior is a self-contained, narrowly-scoped, contained unit — it can’t take down the whole system, and it’s obvious what it touches.

如果每一次修复都意味着修改核心代码，那么修复过程将充满风险且容易冲突，任何智能体（或人类）都不应被信任去快速执行此类操作。解决方案是建立一个注册表：一个接口，使得新行为成为一个自包含、范围狭窄且受控的单元——它不会导致整个系统瘫痪，且其影响范围一目了然。

In the parser (shipped): fixups are middleware in a PostCSS-style registry — each declares a phase, a match predicate, and a handler, and a middleware that throws becomes a contained MIDDLEWARE_ERROR diagnostic while the chain keeps going. A new format quirk is a new middleware with a narrow predicate, not a patch threaded through the core. That containment is exactly what later lets a generated fix be admitted without betting the system on it.

在解析器中（已发布）：修复程序是 PostCSS 风格注册表中的中间件——每个中间件声明一个阶段、一个匹配谓词和一个处理程序。如果某个中间件抛出异常，它会变成一个受控的 MIDDLEWARE_ERROR 诊断信息，而处理链会继续运行。处理一种新的格式怪癖只需添加一个带有狭窄谓词的新中间件，而不是在核心代码中打补丁。这种隔离性正是后来允许自动生成的修复方案被采纳，而无需拿整个系统去冒险的关键。

// A new format quirk is a self-contained middleware with a narrow predicate — 
// not a patch threaded through the core. 
const tnef = { 
  phase: "decode", 
  match: (part) => part.contentType === "application/ms-tnef", 
  handler: (part) => extractWinmailDat(part), 
}; 
registry.use(tnef); 
// If handler throws, the parser records a contained MIDDLEWARE_ERROR 
// diagnostic and the rest of the chain keeps running.

3. Name failures identically everywhere — without leaking data

3. 在任何地方以相同方式命名故障——且不泄露数据

To fix a class of breakage you first have to name it, the same way across every install, without ever collecting private data. That’s a failure signature: a deterministic hash over structure only. It does two things at once — it lets a thousand installs hitting the same bug collapse into one prioritized signal, and it gives the repair loop a precise, shareable target.

要修复一类故障，首先必须对其进行命名，且在所有安装实例中保持一致，同时绝不能收集任何隐私数据。这就是“故障签名”：一种仅针对结构生成的确定性哈希。它同时实现了两件事——它让一千个遇到相同 Bug 的安装实例汇聚成一个优先级信号，并为修复循环提供了一个精确、可共享的目标。

In the parser (shipped): the signature is an FNV-1a hash over PII-free features — diagnostic codes, content-type, transfer-encoding, a byte-shape fingerprint, mailer family, structure path — and never bytes, addresses, or subjects. Two installs on opposite sides of the world hitting the same Outlook-TNEF quirk compute the same hash. A multi-granularity rollup lets you cluster loosely or tightly. (It’s pinned identical across our TypeScript, Python, and Go ports by a golden-corpus test, so the herd can’t drift.)

在解析器中（已发布）：该签名是对不含 PII 的特征进行 FNV-1a 哈希计算的结果——包括诊断代码、内容类型、传输编码、字节形状指纹、邮件客户端系列、结构路径等，绝不包含原始字节、地址或主题。地球两端的两个安装实例如果遇到相同的 Outlook-TNEF 怪癖，会计算出相同的哈希值。多粒度汇总允许你进行松散或紧密的聚类。（通过黄金语料库测试，该逻辑在我们的 TypeScript、Python 和 Go 版本中保持完全一致，确保了集群行为不会产生偏差。）

interface FailureSignature { 
  hash: string; // = fnv1a(canonicalize(features)) 
  features: { 
    scope: "envelope" | "structure" | "part"; 
    diagnosticCodes: string[]; // e.g. ["UNKNOWN_CHARSET"] 
    contentType?: string; // the offending leaf's declared type 
    transferEncoding?: string; 
    byteSignature?: string; // ...
  }
}