Larger Context Windows Don’t Fix RAG — So I Built a System That Does

Larger Context Windows Don’t Fix RAG — So I Built a System That Does

更大的上下文窗口无法修复 RAG——所以我构建了一个能解决此问题的系统

I increased the context window five times. Something unexpected happened each time. 我将上下文窗口扩大了五倍。每一次,意想不到的事情都发生了。

TL;DR: I built a dataset Q&A system and trusted a RAG answer that was less than half-correct. I measured this across 7 query types and 5 context sizes on 100,000 rows. The fix: route computation queries away from RAG entirely. 简而言之: 我构建了一个数据集问答系统,并盲目信任了一个准确率不足一半的 RAG(检索增强生成)回答。我在 10 万行数据上针对 7 种查询类型和 5 种上下文大小进行了测试。解决方案是:将计算类查询完全从 RAG 流程中剥离。

I Trusted the Wrong Number

我信任了错误的数字

Last month I was heads-down building a new feature for EmiTechLogic. Learners can now upload their own messy CSV files and ask questions in plain English about their data. Sounded perfect for RAG, so I went all in — embeddings, retrieval, nice-looking responses. 上个月,我正埋头为 EmiTechLogic 构建一项新功能。用户现在可以上传他们杂乱的 CSV 文件,并用简单的英语询问有关数据的问题。这听起来非常适合使用 RAG,所以我全力投入——嵌入(embeddings)、检索、美观的回复。

The first few demos looked amazing. Clean tables, confident numbers, professional formatting. I actually started trusting the system in our internal testing. 最初的几个演示看起来非常棒。整洁的表格、自信的数字、专业的格式。在内部测试中,我甚至开始信任这个系统了。

Then I picked one number to double-check. Real grocery spend in the dataset: $1,140,033.24. The model gave me a beautiful breakdown by category. It looked legit. I added up the numbers it returned. It was less than half. 后来,我挑选了一个数字进行复核。数据集中真实的杂货支出为 1,140,033.24 美元。模型给出了漂亮的分类明细,看起来很靠谱。但我将它返回的数字加总后发现,结果还不到真实值的一半。

I sat there staring at the screen thinking “this can’t be right.” So I did what any engineer would do. I increased the context window. 4k… 16k… 32k… 128k tokens. Each time the answer got longer, more detailed, and more confidently wrong. 我坐在屏幕前盯着看,心想“这不可能”。于是我做了任何工程师都会做的事:我增加了上下文窗口。4k……16k……32k……128k token。每一次,答案都变得更长、更详细,也更自信地错了。

That’s when it finally clicked. This wasn’t a retrieval issue. I was asking a retrieval system to perform heavy computation on data it had only partially seen. And instead of saying it was unsure or missing information, the model was producing polished, structured answers that looked correct. 那一刻我终于明白了。这不是检索问题。我是在要求一个检索系统对它只看过一部分的数据进行繁重的计算。模型没有说它不确定或信息缺失,而是生成了看起来正确、润色精良且结构化的答案。

Why RAG Cannot Aggregate

为什么 RAG 无法进行聚合计算

The RAG pipeline doesn’t truly understand structured data. All it does is take each CSV row and flatten it into plain text. That’s it. To the model, a row looks something like this: “2019-01-01 grocery_pos 107.23 F NC Jennifer Banks …” RAG 流水线并不真正理解结构化数据。它所做的只是将每一行 CSV 数据扁平化为纯文本。仅此而已。对模型来说,一行数据看起来就像这样:“2019-01-01 grocery_pos 107.23 F NC Jennifer Banks …”

For a query like “What is the total spend by category?”, the RAG pipeline does this: 对于“按类别计算的总支出是多少?”这样的查询,RAG 流水线会执行以下操作:

  1. Tokenise: [“total”, “spend”, “category”]
  2. 分词:[“total”, “spend”, “category”]
  3. Score all 100,000 rows by keyword overlap
  4. 通过关键词重叠度对所有 10 万行数据进行评分
  5. Return the top-N rows as serialised plain text
  6. 将前 N 行数据作为序列化纯文本返回
  7. Ask the LLM to sum and group from that text
  8. 要求大模型(LLM)根据这些文本进行求和与分组

Step 4 is where the system fails. The LLM is not running a SUM. It is pattern-matching numbers from a text blob and generating a response that mimics an aggregation. 第 4 步是系统失败的地方。LLM 并没有在执行 SUM(求和)运算。它只是在从文本块中进行数字模式匹配,并生成一个模仿聚合结果的回复。

Models struggle with numerical precision at scale, but the real issue is the presentation. The model gives you a detailed breakdown across all categories. This is a classic trap. The output looks professional. It mimics the structure of a real report so well that your brain assumes the content is valid. You have no way to verify that 92% of your data is missing. 模型在大规模数值精度上表现不佳,但真正的问题在于呈现方式。模型会给你提供所有类别的详细明细。这是一个典型的陷阱。输出看起来很专业,它完美模仿了真实报告的结构,以至于你的大脑会默认内容是有效的。你根本无法验证 92% 的数据其实已经丢失了。

RAG is a retrieval tool. It is not a calculation engine. Retrieval finds relevant fragments. Computation requires a full dataset scan. When you use RAG for math, you get a wrong answer that looks authoritative. That distinction is critical. A partial answer signals that data is missing. A complete-looking wrong answer just signals false confidence. RAG 是一个检索工具,而不是计算引擎。检索是为了找到相关的片段,而计算需要对整个数据集进行扫描。当你用 RAG 做数学题时,你会得到一个看起来权威但错误的答案。这种区别至关重要:部分答案暗示数据缺失,而看起来完整的错误答案只会传递虚假的自信。

The Benchmark: Two Pipelines, Same Query

基准测试:两个流水线,同一个查询

To measure this precisely, I built a benchmark that runs two pipelines side by side for every query. The first pipeline is a RAG simulation. It models what a naive vector pipeline passes to an LLM at five context sizes. 为了精确衡量这一点,我构建了一个基准测试,为每个查询并行运行两个流水线。第一个是 RAG 模拟,它模拟了一个简单的向量流水线在五种上下文窗口大小下传递给 LLM 的内容。

I tested five context sizes, ranging from 5 rows up to 8,000. That scales from 325 tokens to 500,000. For each size, I tracked three metrics: how much data the LLM sees, what sum it computes from that specific slice, and whether a reader could actually spot the error. 我测试了五种上下文大小,从 5 行到 8,000 行不等,对应的 token 数量从 325 到 500,000。对于每种大小,我跟踪了三个指标:LLM 看到的数据量、它从该切片中计算出的总和,以及读者是否能真正发现错误。

The second pipeline is a semantic engine that executes the same query as a deterministic full-scan over all 100,000 rows and returns the exact correct answer. 第二个流水线是一个语义引擎,它通过对所有 10 万行数据进行确定性的全表扫描来执行相同的查询,并返回完全正确的结果。

(Note: The original article includes a table and further technical breakdown of query types like SUM, AVG, and COUNT. The core takeaway is that RAG fails on these because it lacks the full dataset context required for accurate aggregation.) (注:原文包含表格及对 SUM、AVG、COUNT 等查询类型的进一步技术分析。核心结论是:RAG 在这些任务上会失败,因为它缺乏准确聚合所需的完整数据集上下文。)