Parse, Don’t Guess

Parse, Don’t Guess (解析,而非猜测)

Three days before going open source, I deleted my parser’s smartest feature. 333 lines. 6 functions. 12 passing tests. All green, all clever, all gone. The feature worked. That was the problem. 在项目开源的前三天,我删除了我的解析器中最“聪明”的功能。333 行代码,6 个函数,12 个通过的测试用例。全部运行正常,全部设计巧妙,但全部被删除了。这个功能确实能用,而这正是问题所在。

The feature that knew too much. My PostgreSQL parser (valk-postgres-parser) extracts structure from SQL text: tables, columns, joins, filters. One of its analysis functions did something more ambitious. Give it a query like this: SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id and it would tell you the foreign key relationship: orders.customer_id is a child pointing at parent customers.id. No schema. No catalog access. Just the query text. It felt like magic, and users love magic. 这个功能知道得太多了。我的 PostgreSQL 解析器 (valk-postgres-parser) 可以从 SQL 文本中提取结构:表、列、连接和过滤器。其中一个分析函数做了一些更具野心的事情。给它一个像 SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id 这样的查询,它就能告诉你外键关系:orders.customer_id 是指向父表 customers.id 的子列。无需模式(schema),无需访问目录,仅凭查询文本。这感觉就像魔法,而用户热爱魔法。

Here is how the magic worked: 这就是魔法的运作方式:

func isForeignKeyColumn(column, targetTable string) bool {
    // ...
    if strings.HasSuffix(column, "_id") {
        prefix := strings.TrimSuffix(column, "_id")
        if strings.HasPrefix(targetTable, prefix) {
            return true
        }
        // Also check if table contains the prefix (handles prefixed
        // tables like fk_customers). This allows customer_id to match
        // fk_customers, e2e_customers, etc.
        // ...

Naming conventions. customer_id next to a table called customers? Must be a foreign key. Ship it. 命名约定。customer_id 旁边有一个叫 customers 的表?那一定是外键。发布吧。

The comment that aged badly. The real problem was further down, in the fallback. What happens when neither column matches a naming pattern? The function did this: 那条过时的注释。真正的问题在后面的回退逻辑中。当列名不符合任何命名模式时会发生什么?函数是这样处理的:

// If we can't determine from FK naming conventions, return a default
// relationship based on table order (left table = parent by convention
// in SQL JOINs). This is still a heuristic but is consistent with how
// JOINs are typically written.
return &JoinRelationship{
    ChildTable: rightTable,
    ParentTable: leftTable,
    // ...
}

Read that again. When the function had no idea, it did not say “I have no idea.” It returned an answer anyway, based on which table appeared first in the query. Every JOIN got an answer. That was the bug. Not a crash, not a parse error. A function that is sometimes right, sometimes wrong, and gives the caller no way to tell which. 再读一遍。当函数不知道答案时,它并没有说“我不知道”。它依然返回了一个答案,仅仅基于哪个表在查询中先出现。每一个 JOIN 操作都得到了一个答案。这就是 Bug 所在。不是崩溃,也不是解析错误。这是一个有时正确、有时错误,且让调用者无法分辨的函数。

Wrong is worse than empty. Think about what downstream code does with a foreign key relationship. In our case it generated test data: parent rows first, then children referencing them. Flip parent and child and you get foreign key violations, or worse, data that inserts fine but means the wrong thing. An empty result fails loudly. The caller sees nothing came back and handles it. A wrong result fails quietly, three layers up, a week later, in a system that trusted the library. 错误比空结果更糟糕。想想下游代码是如何处理外键关系的。在我们的案例中,它用于生成测试数据:先插入父行,再插入引用它们的子行。如果颠倒了父子关系,就会导致外键违规,或者更糟糕的情况——数据插入成功了,但含义完全错误。空结果会引发明显的失败,调用者看到没有返回结果并进行处理。而错误的结果会悄无声息地失败,在三层调用之外,一周之后,在一个信任该库的系统中爆发。

And the heuristic had plenty of ways to be wrong: 而且这种启发式方法有太多出错的可能:

  • customer_id happily matched tables named fk_customers and e2e_customers, because of a prefix check added for one test environment.
  • 自引用表 (employees.manager_id) 搞乱了方向逻辑。
  • 连接表(Junction tables,两个连接列都是主键)得到了一个随意的胜出者。
  • 表顺序的回退逻辑不过是伪装成约定的抛硬币行为。

Each bug was fixable. Another pattern, another special case, another test. That is exactly how the function grew to be the smartest 333 lines in the codebase. The line count was not the cost. The confidence was. 每个 Bug 都是可修复的。增加一个模式,增加一个特殊情况,增加一个测试。这正是该函数如何演变成代码库中最“聪明”的 333 行代码的过程。代码行数不是代价,代价是信任。

The replacement: facts or nothing. The fix was not a better heuristic. It was a contract change: 替代方案:事实或无结果。修复方法不是寻找更好的启发式算法,而是改变契约:

schema := map[string][]analysis.ColumnSchema{
    "customers": {{Name: "id", PGType: "bigint", IsPrimaryKey: true}},
    "orders": {{Name: "id", IsPrimaryKey: true}, {Name: "customer_id"}},
}

joins, _ := analysis.ExtractJoinRelationshipsWithSchema(query, schema)

You pass schema metadata, you get relationships derived from actual primary keys. You do not pass schema, you get nothing. If the metadata cannot settle a case (both sides are primary keys), the answer is nil, not a guess. 你传入模式元数据,就能得到基于实际主键推导出的关系。如果不传入模式,就什么也得不到。如果元数据无法确定情况(例如双方都是主键),答案就是 nil,而不是猜测。

Deleting the no-schema path meant deleting tested, working, green code. The 12 tests I removed were not failing. They were carefully asserting that the guesses came out the way the guesses come out. Tests that lock in behavior nobody should rely on are not coverage. They are a fence around a landmine. 删除“无模式”路径意味着删除经过测试、运行正常且绿色的代码。我移除的 12 个测试用例并没有失败。它们只是在小心翼翼地断言猜测的结果符合预期。那些锁定无人应依赖的行为的测试不是覆盖率,它们只是在给地雷围上栅栏。

The migration cost turned out to be near zero: every production caller already had schema metadata sitting right there. They had just never been asked for it. 迁移成本几乎为零:每个生产环境的调用者其实都已经拥有了模式元数据,只是之前从未被要求提供而已。

The part I did not expect. Last week a stranger opened an issue on the repo. He was analyzing hundreds of queries and hit a case where a WHERE clause column was not qualified by a table name, so the parser left the table field empty. He did not ask the parser to guess. He asked for ExtractWhereConditionsWithSchema, by name, with the same schema-map shape the JOIN function uses. The design had taught him what to ask for. That is the moment you find out an API decision was right: when users start requesting extensions to the constraint instead of exceptions from it. 我没预料到的是:上周,一位陌生人在仓库中提交了一个 Issue。他正在分析数百个查询,遇到了一个 WHERE 子句中的列没有被表名限定的情况,因此解析器将表字段留空了。他没有要求解析器去猜测,而是要求提供 ExtractWhereConditionsWithSchema 函数,并使用与 JOIN 函数相同的模式映射结构。这种设计教会了他该如何提出需求。当你发现用户开始要求扩展约束,而不是要求绕过约束时,你就知道你的 API 设计是正确的。

The rule. You probably know “Parse, don’t validate”: make illegal states unrepresentable by parsing input into types that carry proof. This is the next clause of the same contract. Parsing tells the truth about what the input is. It must also tell the truth about what it cannot know. If your library cannot know, it must say so. An empty result is an answer. A guess is a liability with good marketing. 规则:你可能知道“解析,而非验证”(Parse, don’t validate):通过将输入解析为带有证明的类型,使非法状态无法表示。这是同一契约的下一条准则。解析不仅要说明输入是什么,还必须说明它无法知道什么。如果你的库无法确定,它必须直说。空结果也是一种答案。而猜测,不过是包装了精美营销的负债。

We are currently spending billions of dollars teaching language models to say “I don’t know” instead of hallucinating a confident answer. Your API can do it for free, in the type system, today: take the inputs that make the answer knowable, and return nothing when it is not. 我们目前正花费数十亿美元教导语言模型说“我不知道”,而不是一本正经地胡说八道。你的 API 今天就可以在类型系统中免费做到这一点:接收使答案可知的输入,并在无法确定时返回空值。

The smartest code I ever deleted is the reason people trust what is left. The parser is open source at github.com/ValkDB/postgresparser, 260 stars and counting. Issues and PRs welcome, especially the ones that ask for facts instead of guesses. 我删除的最“聪明”的代码,正是人们信任剩余代码的原因。该解析器已开源:github.com/ValkDB/postgresparser,目前已有 260 颗星。欢迎提交 Issue 和 PR,特别是那些要求事实而非猜测的请求。