Water Cooler Small Talk, Ep. 11: Overfitting in RAG evaluation

茶水间闲谈，第 11 期：RAG 评估中的过拟合

Why memorizing for the exam doesn’t mean you understand the subject 为什么死记硬背应付考试并不代表你真正理解了学科

Maria Mouschoutzi | Jun 26, 2026 | 10 min read Maria Mouschoutzi | 2026年6月26日 | 阅读时长 10 分钟

Water Cooler Small Talk is a special kind of small talk, typically observed in office spaces around a water cooler. There, employees frequently share all kinds of corporate gossip, myths, legends, inaccurate scientific opinions, indiscreet personal anecdotes, or outright lies. Anything goes. In my Water Cooler Small Talk posts, I discuss strange and usually scientifically invalid opinions that I, my friends, or some acquaintance of mine have overheard in their office that have literally left us speechless. “茶水间闲谈”是一种特殊的闲聊，通常发生在办公室的饮水机旁。在那里，员工们经常分享各种公司八卦、神话、传说、不准确的科学观点、不检点的个人轶事，甚至是彻头彻尾的谎言。什么都聊。在我的“茶水间闲谈”系列文章中，我会讨论那些我、我的朋友或熟人在办公室里听到的、让我们感到无语的奇怪且通常在科学上站不住脚的观点。

So, here’s the water cooler opinion of today’s post: We’ve built a RAG app that is playing out really well. We are now in the evaluation stage, and it’s going great because through all the testing we keep identifying issues and fixing them. We’re already at a 97% score. 那么，今天文章的茶水间观点是这样的：“我们构建了一个 RAG（检索增强生成）应用，运行效果非常好。我们现在处于评估阶段，进展很顺利，因为在整个测试过程中，我们不断发现问题并修复它们。我们的得分已经达到了 97%。”

Now, I want you to pause for a second and think about what might be wrong with this statement. 🤔 Because on the surface, it sounds perfectly reasonable. Finding issues and fixing them sounds like exactly what a good evaluation process should do, doesn’t it? Responsible, even. So what is really happening? 现在，我想请你停下来思考一下，这段话可能有什么问题。🤔 因为从表面上看，它听起来非常合理。发现问题并修复它们，听起来正是好的评估流程应该做的事，不是吗？甚至可以说很负责任。那么，实际情况到底是什么呢？

The problem here is subtle but fundamental. If you are using your evaluation process to identify issues and then fixing those issues, and then re-evaluating on the same set of tests, you are unfortunately not really evaluating anymore. The evaluation set has one key property that makes it so useful: the model has never seen it before. Each time you fine-tune based on its results and then re-evaluate on the same set, you strip away a little more of that property. In other words, the evaluation set has quietly become part of the development process and is now more of a training set. 这里的问题很微妙，但却是根本性的。如果你利用评估过程来发现问题，然后修复这些问题，接着又在同一组测试集上进行重新评估，那么很遗憾，你实际上已经不再进行评估了。评估集之所以有用，关键在于一个特性：模型从未见过它。每当你根据评估结果进行微调，然后在同一组数据上重新评估时，你就剥离了它的一点点这种特性。换句话说，评估集已经悄悄变成了开发过程的一部分，现在更像是一个训练集。

But doing this properly is easier said than done. In practice, running the evaluation process properly may be genuinely exhausting. In particular, when talking about running evaluations for RAG apps, meaning that the evaluation set is a set of questions and answer pairs, rather than a historical dataset, doing it the right way may be very tiring and time-consuming. Nonetheless, failing to run the evaluations properly results in a very familiar ML issue: overfitting. 但要正确地做到这一点说起来容易做起来难。在实践中，正确执行评估流程可能真的让人精疲力竭。特别是当谈到为 RAG 应用运行评估时，这意味着评估集是一组问答对，而不是历史数据集，以正确的方式进行评估可能会非常累人且耗时。尽管如此，如果不能正确地进行评估，就会导致一个非常熟悉的机器学习问题：过拟合。

What about overfitting? Let’s take a step back and do a little detour to ML basics. In machine learning, a model is built using data that is typically split into a training set, a validation set, and a test set. More specifically, the model is first fit on the training set, which is the data used to indicate what kind of model we need to use and accordingly adjust the model’s parameters. 什么是过拟合？让我们退后一步，回顾一下机器学习的基础知识。在机器学习中，模型的构建通常使用被划分为训练集、验证集和测试集的数据。更具体地说，模型首先在训练集上进行拟合，这些数据用于指示我们需要使用什么样的模型，并据此调整模型的参数。

In its simplest form, the training set consists of x and y pairs of data, and our goal is to come up with a y = f(x) model that optimally fits the available x and y data. Once that is done, the trained model is used to predict outcomes on the validation set. In particular, for each x in the validation set, we generate a predicted y = f(x) based on the selected model, then check how it compares with the actual y of the validation set, and then adjust our model accordingly. 在最简单的形式中，训练集由 x 和 y 数据对组成，我们的目标是得出一个 y = f(x) 模型，使其能最优地拟合现有的 x 和 y 数据。一旦完成，训练好的模型就会被用来预测验证集上的结果。具体来说，对于验证集中的每个 x，我们根据所选模型生成一个预测值 y = f(x)，然后将其与验证集中的实际 y 进行比较，并据此调整我们的模型。

At the very end, and after having decided on which model we want to ultimately proceed based on the validation step, we also run it on the test set. The goal of the test set is to see how well the final model generalises to data it has never seen before by calculating its scores, and this is why the test set should only be used once. We do all this because our goal isn’t to fit the training set, but rather what the training set represents. 最后，在根据验证步骤决定了最终要使用的模型后，我们会在测试集上运行它。测试集的目标是通过计算分数，查看最终模型对从未见过的数据的泛化能力如何，这就是为什么测试集只能使用一次的原因。我们做这一切是因为我们的目标不是拟合训练集，而是拟合训练集所代表的规律。

In this way, we can create models that learn the underlying patterns well enough to make accurate predictions on new, unseen data (the test set). Unfortunately, sometimes we fail to do so, and instead of creating models that fit the general case, we create models that just fit a narrow training set without generalising. This is what we call overfitting. As a result, the model performs exceptionally well on the training set, achieving impressive scores, but poorly on anything new. 通过这种方式，我们可以创建出能够充分学习底层模式的模型，从而对新的、未见过的数据（测试集）做出准确预测。遗憾的是，有时我们无法做到这一点，我们创建的模型不是拟合一般情况，而是仅仅拟合了一个狭窄的训练集，而没有泛化能力。这就是我们所说的过拟合。结果就是，模型在训练集上表现异常出色，得分令人印象深刻，但在处理任何新事物时表现都很差。

The trick here is that the test set is meaningful only if the model has genuinely never seen it before. The moment you use it to make a decision about the model, even an apparently small one, you have compromised it and essentially merged it with the training set. But after this little detour to ML basics, let’s get back to our original water cooler opinion. 这里的诀窍在于，只有当模型确实从未见过测试集时，它才有意义。一旦你用它来对模型做出决定，哪怕是一个看似微小的决定，你就已经破坏了它，并实质上将其与训练集合并了。但在回顾了这些机器学习基础知识后，让我们回到最初的那个茶水间观点。

Overfitting in RAG evaluation: This is where things get particularly relevant for those of us building and evaluating AI applications. In my series on evaluating RAG pipelines, we talked a lot about retrieval metrics: Precision@k, Recall@k, MRR, NDCG@k, and so on. Nevertheless, all those fancy metrics are only ever as useful as the evaluation set you apply them to. RAG 评估中的过拟合：对于我们这些构建和评估 AI 应用的人来说，这一点尤为重要。在我关于评估 RAG 流水线的系列文章中，我们讨论了很多检索指标：Precision@k、Recall@k、MRR、NDCG@k 等等。然而，所有这些花哨的指标，其有效性仅取决于你应用它们的评估集。

It turns out that the line between evaluation and test sets in RAG can blur surprisingly easily. I would attribute part of this to the fact that, unlike a simple regression model, AI models and RAG pipelines are far from intuitive to us. We have little real intuition for how the model is actually fitting to the data, and as a result, we may get carried away and tune the system based on the test set without even realizing we did so. 事实证明，RAG 中评估集和测试集之间的界限很容易变得模糊。我认为部分原因在于，与简单的回归模型不同，AI 模型和 RAG 流水线对我们来说远非直观。我们对模型如何真正拟合数据几乎没有真正的直觉，因此，我们可能会在不知不觉中根据测试集来调整系统，从而走偏。

The team in our water cooler story is doing exactly this. They identify issues during evaluation, fix them, and re-evaluate on the same question-answer pairs. Naturally, in every iteration, the evaluation scores improve because essentially they are now fitting the AI app on the test set. In particular, here are the most common ways this can happen in RAG: Tuning prompts on the evaluation set: This is probably the most common pattern, and it is exactly what happened in our water cooler story. You run an evaluation, notice that certain question types consistently fail, and adjust your… 我们茶水间故事中的团队正是这样做的。他们在评估过程中发现问题，修复问题，然后在相同的问答对上重新评估。自然地，在每一次迭代中，评估分数都会提高，因为本质上他们现在是在测试集上拟合 AI 应用。具体来说，以下是 RAG 中最常发生这种情况的方式：在评估集上调整提示词（Prompts）：这可能是最常见的模式，也正是我们茶水间故事中发生的事情。你运行一次评估，注意到某些类型的问题总是失败，然后调整你的……