AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

AI Evals, Part 2: Error Analysis — The Unglamorous Superpower Behind Good Evals

AI 评估系列之二:错误分析——优秀评估背后那项不起眼的“超能力”

Part 2 of a series on building production AI on .NET. Part 1 covered what evals are and the Analyze → Measure → Improve lifecycle. This post is about the step everyone wants to skip: Analyze. 这是关于在 .NET 上构建生产级 AI 系列文章的第二篇。第一篇介绍了什么是评估(evals)以及“分析 → 衡量 → 改进”的生命周期。本文将探讨每个人都想跳过的那一步:分析

When a team decides to “take evals seriously,” the first thing they usually do is wrong. They open a dashboard tool, wire up a generic “correctness” score, and watch a number. It feels productive. It produces a chart. And it tells them almost nothing, because they skipped the step that decides what the chart should even measure. That step is error analysis: reading your AI’s actual outputs and naming, precisely, the ways they go wrong. It’s unglamorous — no library, no dashboard, just you and a few dozen real examples. It is also, by a wide margin, the highest-leverage thing you will do in evals: error analysis is where the signal comes from. Everything downstream is just operationalising what you find here. 当一个团队决定“认真对待评估”时,他们通常做的第一件事就是错的。他们打开仪表盘工具,接入一个通用的“正确性”分数,然后盯着数字看。这感觉很有成效,也确实生成了图表。但它几乎什么都没告诉他们,因为他们跳过了决定“图表应该衡量什么”的那一步。这一步就是错误分析:阅读 AI 的实际输出,并精确地命名它们出错的方式。这并不光鲜——没有库,没有仪表盘,只有你和几十个真实的案例。但这也是你在评估工作中能做的杠杆率最高的事情:错误分析是信号的来源。后续的一切工作,都只是将你在此发现的问题进行落地执行。

Why you can’t skip straight to metrics

为什么你不能直接跳到指标衡量

There’s a gap between you and your running system that’s easy to underestimate. Thousands of inputs flow through your AI feature daily, in shapes you never anticipated, and you have no realistic way to see them at scale. Call it the comprehension gap — the distance between the developer and a true understanding of what the data and the model are actually doing. Metrics don’t bridge that gulf; they presuppose it’s already bridged. To measure “conciseness” you must first have noticed that verbosity is a failure mode worth caring about. If you pick your metrics before you’ve read your data, you’re measuring your assumptions, not your product. The classic result: a dashboard glowing green while users quietly churn over a problem your metrics were never designed to catch. Error analysis is how you cross the gulf. You trade scale for truth — you can’t read everything, so you read a sample, carefully. 你与正在运行的系统之间存在着一道容易被低估的鸿沟。每天有成千上万的输入流经你的 AI 功能,其形式是你从未预料到的,而你也没有现实的方法去大规模地查看它们。这可以称为“理解鸿沟”——即开发者与对数据和模型实际运行情况的真实理解之间的距离。指标无法弥合这一鸿沟;它们预设了鸿沟已经被弥合。要衡量“简洁性”,你必须首先注意到“冗长”是一个值得关注的失败模式。如果你在阅读数据之前就选定了指标,那么你衡量的是你的假设,而不是你的产品。典型的结果是:仪表盘显示一片绿灯,而用户却因为一个你的指标从未设计去捕捉的问题而悄悄流失。错误分析就是你跨越这道鸿沟的方式。你用规模换取真相——你无法阅读所有内容,所以你要仔细阅读一个样本。

How error analysis actually works

错误分析是如何运作的

It’s a three-move loop, and the moves are deliberately low-tech. 这是一个三步循环,而且这些步骤特意保持了低技术含量。

  1. Get a starting dataset and read it. Pull a sample of real (or realistic) outputs — 50 to 100 is plenty to start. Not the happy-path demo cases; the real distribution, including the weird inputs. Then actually read them. Slowly.

  2. 获取初始数据集并阅读。 提取一份真实(或逼真)的输出样本——50 到 100 个足以开始。不要只看那些理想情况下的演示案例;要看真实的分布,包括那些奇怪的输入。然后真正地去阅读它们。慢慢地读。

  3. Open-code the failures. For each output that’s wrong, write a short, free-text note describing what specifically is wrong — in your own words, no fixed categories yet. “Explained the word using a dictionary definition instead of the meaning it has in this sentence.” “Translation is correct but the tone is far too formal for a casual chat.” “The quiz distractor is so obviously wrong it gives the answer away.” This is open coding: you’re labelling reality, not forcing it into boxes.

  4. 对失败进行开放式编码。 对于每一个错误的输出,写一段简短的自由文本笔记,描述具体哪里出了问题——用你自己的话,先不要设定固定的类别。“用字典定义解释单词,而不是它在句子中的含义。”“翻译是正确的,但语气对于随意的聊天来说太正式了。”“测验的干扰项太明显了,直接泄露了答案。”这就是开放式编码:你在标记现实,而不是强行将其塞入预设的框框里。

  5. Cluster the notes into a taxonomy. Once you have 40–50 notes, patterns emerge. Group them. Those groups are your failure taxonomy — a ranked list of how your feature fails, with rough frequencies. Now you know what to fix first (the common, severe modes) and, crucially, what your metrics should measure. That’s the whole secret. The taxonomy is the output, and it’s worth more than any single score, because every later step — the rubric, the golden set, the judge — is downstream of it.

  6. 将笔记聚类为分类体系。 一旦你有 40-50 条笔记,模式就会显现。将它们分组。这些组就是你的失败分类体系——一份按频率大致排序的、关于你的功能如何失败的清单。现在你知道首先要修复什么(常见且严重的模式),更重要的是,你知道你的指标应该衡量什么。这就是全部的秘密。分类体系本身就是产出,它比任何单一的分数都更有价值,因为后续的每一个步骤——评分标准、黄金数据集、评估器——都是它的下游。

A mindset note: be a detective, not a judge (yet)

心态提示:做一名侦探,而不是(现在就做)法官

The hard part of error analysis isn’t mechanical, it’s psychological. You will be tempted to immediately assign a 1–5 score, or to jump to “the fix is to add a line to the prompt.” Resist both. Scoring too early collapses rich information (“it’s a 2”) into a number that hides why. Fixing too early means you patch the first failure you see instead of the most common one. Stay descriptive for as long as you can. Your only job in this phase is to understand and categorise. Judgement and repair come later. 错误分析的难点不在于机械操作,而在于心理层面。你会忍不住想立即给出一个 1-5 分的评分,或者直接跳到“修复方法就是给提示词加一行指令”。抵制这两种冲动。过早评分会将丰富的信息(“这是一个 2 分”)压缩成一个掩盖原因的数字。过早修复意味着你修补的是你看到的第一个失败,而不是最常见的那个。尽可能长时间地保持描述性。在这个阶段,你唯一的工作就是理解和分类。判断和修复是后续的事。

A second trap is doing it alone. When two people label the same outputs, they disagree — and the disagreements are gold, because they reveal that “good” isn’t actually defined yet. A short alignment session to resolve them sharpens your definition of quality before you bake it into a rubric. (Solo founders can approximate this by labelling, sleeping on it, and re-labelling cold.) 第二个陷阱是独自完成。当两个人标记相同的输出时,他们会产生分歧——而这些分歧是金子,因为它们揭示了“好”的定义尚未真正确立。通过简短的对齐会议来解决这些分歧,可以在将其固化为评分标准之前,磨练你对质量的定义。(独立创始人可以通过标记、搁置一晚、然后在冷静状态下重新标记来模拟这一过程。)

How error analysis shaped TextStack’s evals

错误分析如何塑造了 TextStack 的评估

This isn’t abstract for us. TextStack has seven AI surfaces, and every rubric we score against came directly out of reading failures, not out of a generic template. Take Explain (tap a word, get a short in-context explanation). Reading real outputs surfaced a recurring failure: the model would produce a competent dictionary definition while ignoring the sentence the reader was actually looking at — useless for someone trying to understand this passage. That single observation is why the Explain rubric scores accuracy in context and usefulness to a learner as distinct axes, and explicitly penalises dictionary boilerplate under conciseness. The rubric is a direct transcription of the taxonomy. 这对我们来说不是抽象的。TextStack 有七个 AI 界面,我们用来评分的每一个标准都直接源于对失败案例的阅读,而不是来自通用的模板。以“解释”(点击一个单词,获得简短的上下文解释)为例。阅读真实的输出揭示了一个反复出现的失败:模型会给出一个合格的字典定义,却忽略了读者实际正在阅读的句子——这对试图理解这段话的人来说毫无用处。正是这一观察结果,使得“解释”功能的评分标准将“上下文准确性”和“对学习者的有用性”作为独立的维度,并明确在“简洁性”项下惩罚字典式的套话。评分标准就是分类体系的直接转录。

Other surfaces produced different taxonomies, and therefore different axes: Translate kept failing on register — accurate but wrong formality — so register became its own scored dimension alongside accuracy and fluency. Vocabulary distractors (wrong answers in a quiz) failed by being implausible (too obviously wrong) or too similar to the right answer, so the rubric scores plausibility, distinctness, and difficulty. We didn’t invent those dimensions in a meeting. We read outputs until the dimensions were obvious. And because every AI call is traced and viewable on an internal /ai-quality page, error analysis isn’t a one-time exercise — new production failures keep feeding new categories back into the taxonomy. 其他界面产生了不同的分类体系,因此也有了不同的维度:“翻译”功能在语体(register)上不断失败——翻译准确但正式程度不对——因此语体成为了与准确性和流畅性并列的独立评分维度。“词汇干扰项”(测验中的错误答案)的失败在于不可信(太明显是错的)或与正确答案太相似,因此评分标准增加了可信度、区分度和难度。我们不是在会议中凭空发明这些维度的。我们阅读输出,直到这些维度变得显而易见。而且,由于每一次 AI 调用都被追踪并可以在内部的 /ai-quality 页面上查看,错误分析不是一次性的练习——新的生产失败会不断将新的类别反馈回分类体系中。

The pitfalls

陷阱

  • Scoring before describing. A number erases the why. Open-code in words first.
  • 过早评分而非描述。 数字会抹去原因。先用文字进行开放式编码。
  • Vague categories. “Bad output” isn’t a category; “ignored the sentence context” is. Specific enough to act on.
  • 模糊的类别。 “输出很差”不是一个类别;“忽略了句子上下文”才是。要具体到可以采取行动。
  • Too small a sample, or only the easy cases. If you only read successes, you’ll conclude everything is fine.
  • 样本太小,或只看简单案例。 如果你只读成功的案例,你会得出一切正常的结论。
  • Fixing during analysis. Note the failure, move on. Triage after you can see the whole picture.
  • 在分析过程中进行修复。 记录失败,然后继续。在看清全貌后再进行分类处理。
  • Labelling solo.
  • 独自标记。