Parse, Don't Validate — In a Language That Doesn't Want You To

Parse, Don’t Validate — In a Language That Doesn’t Want You To

解析,而非验证 —— 在一门并不“配合”你的语言中

Update: If you liked this post, the follow-up — Effect Without Effect-TS: Algebraic Thinking in Plain TypeScript — picks up where we left off and takes the ideas further. 更新:如果你喜欢这篇文章,后续文章《Effect Without Effect-TS: Algebraic Thinking in Plain TypeScript》将承接此处,进一步探讨这些思想。

I’ve been thinking about Alexis King’s Parse, don’t validate again. I do this quite regularly, actually, usually after staring at a TypeScript codebase that’s been quietly accumulating if (user.email) checks like barnacles. The post is from 2019, and the advice (or rather principle) is way older than that. And yet most TypeScript I read — including, embarrassingly, plenty I’ve written — still validates instead of parsing. 我最近又在思考 Alexis King 的《解析,而非验证》(Parse, don’t validate)。事实上,我经常这样做,通常是在盯着一个像附着了藤壶一样悄悄堆积了无数 if (user.email) 检查的 TypeScript 代码库之后。那篇文章发表于 2019 年,而其中的建议(或者说原则)则更为久远。然而,我读到的大多数 TypeScript 代码——包括令人尴尬的是,我自己写的很多代码——仍然是在做“验证”而非“解析”。

The pitch, if you haven’t read it (you should): a validator says “this thing is fine, please continue.” A parser says “give me a blob, and I’ll either give you back a more precise type or tell you why I can’t.” The difference sounds academic until you realize that validators throw away information the moment they finish running, while parsers preserve what they learned by encoding it in the type. Once you’ve parsed a string into an EmailAddress, the rest of your program never has to wonder again. Peace of mind and more mental capacity for the fun stuff. 如果你还没读过那篇文章(建议去读),它的核心观点是:验证器(Validator)说的是“这个东西没问题,请继续”;而解析器(Parser)说的是“给我一堆原始数据,我要么给你返回一个更精确的类型,要么告诉你为什么不行”。这种区别听起来很学术,直到你意识到验证器在运行结束的那一刻就丢弃了信息,而解析器通过将所学到的知识编码进类型中来保留信息。一旦你将一个字符串解析为 EmailAddress,程序的其余部分就再也不用为此担心了。这不仅让你心安,还能腾出更多脑力去处理更有趣的事情。

In Haskell or Elm or F# this is just how you write code. The language pulls you toward it. In TypeScript… it doesn’t. TypeScript will happily let you do the right thing, but it won’t insist, and it won’t even gently nudge. If anything, structural typing actively undermines the whole game. 在 Haskell、Elm 或 F# 中,这就是编写代码的标准方式。语言本身会引导你这样做。但在 TypeScript 中……并非如此。TypeScript 很乐意让你做正确的事,但它不会强求,甚至不会给你任何暗示。如果说有什么影响的话,那就是结构化类型系统(Structural Typing)实际上在某种程度上破坏了这种模式。

Let me show you what I mean. 让我来解释一下我的意思。

The validator we’ve all written

我们都写过的验证器

Here’s the kind of code I see (and write) constantly: 这是我经常看到(也经常写)的代码:

interface User { id: number; email: string; age: number; }

// The actual validation is naîve and simplistic, but you get the point:
function isValidUser(user: User): boolean {
  if (!user.email.includes("@")) return false;
  if (user.age < 0 || user.age > 150) return false;
  return true;
}

function sendWelcome(user: User) {
  if (!isValidUser(user)) {
    throw new Error("invalid user");
  }
  // ...later, deeper in the call stack:
  emailService.send(user.email, `Welcome, age ${user.age}`);
}

Spot the lie? User.email is just string. User.age is just number. The validation happened — congrats — but the type system forgot about it the instant isValidUser returned. Three function calls deeper, when somebody touches user.email, there is nothing stopping them from passing it to a function that expects a real email. Because as far as TypeScript is concerned, it’s just a string. Same as "", same as "hello", same as "definitely not an email". 发现谎言了吗?User.email 只是一个 stringUser.age 只是一个 number。验证确实发生了——恭喜你——但类型系统在 isValidUser 返回的那一刻就把它忘了。在调用栈更深处,当有人触碰 user.email 时,没有任何机制阻止他们将其传递给一个期望获得真实邮箱的函数。因为在 TypeScript 看来,它依然只是一个字符串。和 "" 一样,和 "hello" 一样,和 "definitely not an email" 一样。

So what do we do? We re-validate. We add another if. We write a unit test. We hope. (King has a much better word for this in the original post: “shotgun parsing” — validation scattered everywhere, none of it remembered.) 那么我们该怎么办?我们重新验证。我们添加更多的 if。我们编写单元测试。我们祈祷。(King 在原文中用了一个更好的词来形容这种现象:“霰弹枪式解析”——验证散落在各处,却没有任何地方能记住验证结果。)

What we actually want

我们真正想要的

We want this: 我们想要的是这样:

function sendWelcome(user: ValidUser) {
  emailService.send(user.email, `Welcome, age ${user.age}`);
}

And we want it to be impossible to call sendWelcome with anything that hasn’t been through the parser. No re-checking or “defensive programming”. The type itself serves as the proof, as it were. 我们希望在没有经过解析器处理的情况下,根本无法调用 sendWelcome。不需要重复检查,也不需要“防御性编程”。类型本身就充当了证明。

In Elm I’d reach for an opaque type and a smart constructor and be done in about four lines. In TypeScript it’s, well, possible at least. Just less pleasant. 在 Elm 中,我只需要使用不透明类型(Opaque Type)和智能构造函数(Smart Constructor),四行代码就能搞定。在 TypeScript 中,嗯,至少是可能的,只是没那么优雅。

Branded types, or: lying to the structural type system on purpose

品牌类型(Branded types),或者:故意欺骗结构化类型系统

TypeScript is structurally typed, which means two types with the same shape are the same type. string is string is string. There’s no newtype. There’s no type EmailAddress = String that produces a genuinely distinct type the way, say, Haskell does it. TypeScript 是结构化类型的,这意味着两个形状相同的类型就是同一个类型。string 就是 string 就是 string。这里没有 newtype。没有像 Haskell 那样能产生真正不同类型的 type EmailAddress = String

The workaround the community has settled on is branding — also called tagging, also called nominal typing via intersection. The cheap version is a string-literal phantom ({ readonly __brand: "Email" }) and you’ll see it everywhere; the slightly less cheap version uses a unique symbol that you don’t export from the module, so nobody outside can even spell the brand to forge it: 社区采用的变通方法是“品牌化”(Branding)——也称为标记(Tagging),或通过交叉类型实现的标称类型(Nominal typing)。廉价版本是使用字符串字面量幻影类型({ readonly __brand: "Email" }),你随处可见;稍微高级一点的版本是使用一个不从模块导出的 unique symbol,这样外部代码甚至无法拼写出这个品牌来伪造它:

declare const EmailBrand: unique symbol;
declare const AgeBrand: unique symbol;

type Email = string & { readonly [EmailBrand]: true };
type Age = number & { readonly [AgeBrand]: true };

There is no brand field at runtime. It’s a “phantom” — a type-level marker that makes Email and string incompatible at compile time. The only way to get an Email is through a function that knows how, because nothing outside this module can even name the symbol to fake one. (TS5 also lets you flirt with template literal types — type Email = ${string}@${string} — which is fun for a demo and not enough on its own.) This is the move that lets you make illegal states unrepresentable without leaving the language. 运行时并没有这个品牌字段。它是一个“幻影”——一个在编译时让 `Email` 和 `string` 不兼容的类型级标记。获取 `Email` 的唯一方法是通过一个知道如何处理的函数,因为模块外部没有任何代码能引用这个符号来伪造它。(TS5 也允许你尝试模板字面量类型——`type Email = `${string}@${string}——这在演示时很有趣,但仅靠它是不够的。)这就是让你在不离开语言特性的前提下,实现“让非法状态无法表示”的手段。

The brand is one-way, by the way: an Email is still assignable to string. Nominal into the domain, structural on the way out, which is pretty much exactly what you want. 顺便说一下,这个品牌是单向的:Email 仍然可以赋值给 string。进入领域模型时是标称的,离开时是结构化的,这几乎正是你想要的。

That function is your parser: 那个函数就是你的解析器:

type ParseError = { kind: "ParseError"; message: string };
type Parsed<T> = { kind: "ok"; value: T } | { kind: "err"; error: ParseError };

function parseEmail(raw: string): Parsed<Email> {
  if (!raw.includes("@")) {
    return { kind: "err", error: { kind: "ParseError", message: "missing @" } };
  }
  // we've checked, now we lie to the type system on purpose
  return { kind: "ok", value: raw as Email };
}

function parseAge(raw: unknown): Parsed<Age> {
  if (
    typeof raw !== "number" ||
    !Number.isInteger(raw) ||
    raw < 0 ||
    raw > 150
  ) {
    return { kind: "err", error: { kind: "ParseError", message: "bad age" } };
  }
  return { kind: "ok", value: raw as Age };
}

(The parseEmail predicate is embarrassingly thin — a real one would trim, lowercase, and at least pretend to validate the domain part. I’m not, however, writing an email parser in a blog post(!).) The as Email hurts a little, and it should. It’s the one place where we’re allowed to break the rules — the parser is the trusted boundary. Everywhere else in the codebase, you cannot conjure an Email out of a string. You have to call parseEmail and handle both branches. (I’m using kind: "ok" | "err" instead of a bool.) (parseEmail 的谓词逻辑薄弱得令人尴尬——真正的解析器会进行修剪、转小写,并至少假装验证域名部分。不过,我毕竟不是在写一个真正的邮箱解析器(!)。)这里的 as Email 让人有点心痛,也确实应该如此。这是我们唯一被允许打破规则的地方——解析器是受信任的边界。在代码库的其他任何地方,你都无法凭空从一个字符串变出一个 Email。你必须调用 parseEmail 并处理两种分支。(我在这里使用 kind: "ok" | "err" 而不是布尔值。)