LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships

LLM 评估全靠“感觉”？我构建了决定产品能否上线的关键层

Large Language Model LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships How I built a lightweight evaluation system that measures faithfulness, detects hallucinations, and turns subjective LLM outputs into reproducible metrics — all in pure Python. 大语言模型（LLM）的评估往往全凭“感觉”——我构建了一个缺失的评估层，用来决定模型输出是否可以上线。我通过纯 Python 构建了一个轻量级评估系统，能够衡量忠实度、检测幻觉，并将主观的 LLM 输出转化为可复现的指标。

TL;DR This article shows a full working implementation in pure Python, with real benchmark numbers. Most teams evaluate LLM responses by reading them and guessing. That breaks the moment you scale. The real problem is not that models hallucinate. It is that nothing catches the confident ones, the responses that score 0.525, pass your threshold, and are quietly wrong. 简而言之：本文展示了一个纯 Python 实现的完整方案，并附带真实的基准测试数据。大多数团队评估 LLM 回复的方式是“阅读并猜测”，但这在规模化时会失效。真正的问题不在于模型会产生幻觉，而在于没有任何机制能捕捉到那些“自信的错误”——即那些得分 0.525、刚好通过阈值却悄悄出错的回复。

I built a scoring layer that splits faithfulness into two signals: attribution and specificity. High specificity plus low attribution is the signature of a hallucination. A single score misses it every time. This is not an evaluation script. It is a decision engine that sits between your model and your user. 我构建了一个评分层，将忠实度拆分为两个信号：归因（Attribution）和特异性（Specificity）。高特异性加上低归因，就是幻觉的典型特征。单一的评分指标总是会漏掉这种情况。这不是一个简单的评估脚本，而是一个位于模型和用户之间的决策引擎。

I Changed One Line in My Prompt. Everything Broke.

我修改了提示词中的一行代码，结果一切都崩了。

Three words broke my eval system: “be specific and detailed.” I added them to my system prompt on a Tuesday afternoon. Routine change. The kind you make a dozen times when you’re tuning a RAG pipeline. I ran my next test batch an hour later and question three came back like this: “Context engineering was invented at MIT in 1987 and is primarily used for hardware cache optimization in CPUs. It has nothing to do with language models.” 三个词毁了我的评估系统：“具体且详细（be specific and detailed）”。周二下午，我把这几个词加到了系统提示词中。这本是常规操作，就像你在调试 RAG 流水线时会做的那样。一小时后，我运行了下一批测试，第三个问题的回答是：“上下文工程（Context engineering）于 1987 年在麻省理工学院发明，主要用于 CPU 的硬件缓存优化。它与语言模型毫无关系。”

My scorer gave it 0.525. Above my passing threshold of 0.5. Green light. I almost missed it. I was skimming outputs the way you do when you’ve been staring at test results for two hours, checking scores, not reading sentences. The only reason I caught it was that “1987” looked wrong to me. I read it twice and pulled up the context doc. The model had invented every specific detail in that sentence. 我的评分器给出了 0.525 分，高于 0.5 的通过阈值。绿灯通过。我差点就漏掉了。当时我正像盯着测试结果两小时后那样快速浏览输出，只看分数，不读句子。我能发现问题的唯一原因是“1987”看起来不对劲。我读了两遍并调出了上下文文档，发现模型编造了那句话里的每一个细节。

The score had gone up because the response got more specific. The quality had collapsed because the model got more confident about things it was fabricating. My eval layer had one number to cover both directions, and it couldn’t tell them apart. I caught it manually that time. That is not a process. That is luck. And the whole point of an eval system is that it should not depend on whether you happen to be reading carefully on a given afternoon. 分数上升是因为回复变得更“具体”了，但质量却因为模型对自己编造的内容表现得过于自信而崩塌了。我的评估层用一个数字来涵盖这两个方向，却无法区分它们。那次是我手动发现的。那不是流程，那是运气。而评估系统的全部意义在于，它不应该取决于你某天下午是否恰好读得仔细。

But the moment you try to actually fix it, things get complicated. Like, how do you even define “good”? If you just ask another LLM to judge the first one, you’re just moving the problem up a level. The real danger isn’t a broken response; it’s the one that sounds like an expert but is quietly lying to you. 但当你试图真正解决这个问题时，事情就变得复杂了。比如，你该如何定义“好”？如果你只是让另一个 LLM 去评判第一个，你只是把问题推向了更高一层。真正的危险不是那种明显的错误回复，而是那种听起来像专家、却在悄悄欺骗你的回复。

Most tutorials tell you to just call the model and see if the output “looks right.” But look at the numbers. What happens when your response scores 0.525 overall, technically acceptable, but its grounding score is 0.428 and its specificity is 0.701? That combination means confident but ungrounded. That is not a borderline response. That is a hallucination wearing a business suit. 大多数教程告诉你只需调用模型，看看输出是否“看起来正确”。但看看数据吧。如果你的回复总分是 0.525（技术上可接受），但其基础得分（Grounding score）为 0.428，特异性得分为 0.701，会发生什么？这种组合意味着：自信但缺乏依据。这可不是什么边缘情况，这就是穿着西装的幻觉。

These are not rare edge cases. This is what happens by default in production LLM systems, and you will not catch it with a vibe check. The answer is a missing layer most teams skip entirely. Between LLM output and user delivery, there is a deliberate step: deciding whether the response should be served, retried, or regenerated. I built that layer. 这些不是罕见的边缘情况，这是生产环境 LLM 系统中的默认现象，你无法通过“感觉”来捕捉它。答案在于大多数团队完全忽略的一个缺失层。在 LLM 输出和交付给用户之间，有一个关键步骤：决定该回复是应该直接发送、重试还是重新生成。我构建了这一层。