Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG

Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG

检索是过滤,而非搜索:企业级 RAG 的思维模型

Large Language Models Retrieval Is Filtering, Not Search: A Mental Model for Enterprise RAG Enterprise Document Intelligence [Vol.1 #7A] – Stop searching strings. Filter line_df and toc_df. Pick anchors small, expand context large. 大型语言模型检索是过滤,而非搜索:企业级 RAG 的思维模型。企业文档智能 [第1卷 #7A] —— 停止搜索字符串。过滤 line_dftoc_df。锚点要小,上下文要大。

This article is the retrieval brick of Enterprise Document Intelligence, a series that builds an enterprise RAG system from four bricks: parsing, question parsing, retrieval, and generation. Retrieval is the third brick, and this is the first of its three parts, the mental model: retrieval is filtering, not search; filter line_df and toc_df, pick anchors small, expand context large. 本文是“企业文档智能”系列的检索模块。该系列通过四个模块构建企业级 RAG 系统:解析、问题解析、检索和生成。检索是第三个模块,本文是该模块三部分中的第一部分,即思维模型:检索是过滤,而非搜索;过滤 line_dftoc_df,锚点要小,上下文要大。

Watch how a human searches a document. Someone at work wants to know how many vacation days they get this year. They open the HR policy PDF. They press Ctrl+F. They type “vacation”. Fifteen hits scroll past, some in headings, some in the TOC. They jump to the right paragraph, read the rule, and have their answer in 60 seconds. That’s not a novice doing it wrong. That’s a professional doing it the most efficient way they know: keywords they know are in the document, the TOC the author already wrote, reading a whole section when they suspect it’s the right one. Where in this process is “embedding similarity”? Nowhere. 观察人类如何搜索文档。某位员工想知道今年有多少天年假。他们打开人力资源政策 PDF,按下 Ctrl+F,输入“vacation”(假期)。十五个匹配项滚动而过,有的在标题中,有的在目录中。他们跳转到正确的段落,阅读规则,并在 60 秒内得到答案。这不是新手操作失误,而是一位专业人士以他们所知的最高效方式在工作:使用他们确信文档中存在的关键词、作者编写的目录,并在怀疑是正确章节时阅读整段内容。在这个过程中,“嵌入相似度”(embedding similarity)在哪里?根本不存在。

Sometimes Ctrl+F finds nothing: The doc calls it “PTO” not “vacation”. Or the text sits inside a scanned page that Ctrl+F can’t see. The expert tries a synonym, then a third. Still zero hits. Then the expert opens the table of contents. They scan the section titles, click the most likely one (“Leave and Time Off”), and read the body. That fallback (keyword first, TOC navigation when the keyword fails) is what professional document work has run on for thirty years. 有时 Ctrl+F 找不到任何内容:文档中称其为“PTO”而非“vacation”。或者文本位于 Ctrl+F 无法识别的扫描页面中。专家会尝试同义词,再试第三个。依然零匹配。然后,专家会打开目录,扫描章节标题,点击最可能的一项(“请假与休假”),并阅读正文。这种回退机制(先关键词,关键词失败后转向目录导航)是专业文档工作三十年来一直依赖的方式。

This article (the first of three) builds the mental model behind that workflow: retrieval is a filtering problem on two structured tables (line_df and toc_df), not a search problem. It also introduces the anchor / context distinction (where the match lands versus what gets passed to generation) which the other two articles build on. 本文(三篇中的第一篇)构建了该工作流背后的思维模型:检索是在两个结构化表(line_dftoc_df)上的过滤问题,而非搜索问题。它还引入了锚点/上下文的区别(匹配落点与传递给生成器的内容),这也是后续两篇文章的基础。

The “amplify the expert” stance: codify the expert’s workflow, then do it better than they can manually. Three concrete lifts: The expert types one keyword at a time. The system can detect co-occurrence of multiple keywords on the same page or section in a single pass. The expert sees nothing when words are locked in scanned images. The parsing brick runs OCR at ingestion, so image-bound text becomes searchable like any other line. The expert scans the TOC manually. The system joins TOC and content programmatically: pick the right section from the map, then scope the keyword search inside that section’s body. “放大专家能力”的立场:将专家的工作流代码化,然后比他们手动操作做得更好。三个具体的提升:专家一次只能输入一个关键词,而系统可以在单次扫描中检测同一页面或章节中多个关键词的共现;当文字被锁定在扫描图像中时,专家什么也看不到,而解析模块在摄入时运行 OCR,使图像中的文本变得像其他行一样可搜索;专家手动扫描目录,而系统通过编程方式连接目录与内容:从映射中选择正确的章节,然后在该章节正文内限定关键词搜索范围。

Once the parsing brick has produced clean DataFrames, retrieval becomes a filtering problem on structured tables: filter line_df (the text) and toc_df (the map). This article builds up that mental model. The next two articles build the mechanics on top. 一旦解析模块生成了干净的 DataFrame,检索就变成了结构化表上的过滤问题:过滤 line_df(文本)和 toc_df(映射)。本文旨在建立这一思维模型,接下来的两篇文章将在此基础上构建具体机制。

  1. Retrieval as filtering on structured tables
  2. 作为结构化表过滤的检索

The standard framing of retrieval is find the passages most similar to the query. That framing is misleading because it imports the wrong mental model. Once parsing has produced clean DataFrames, retrieval is no longer a search problem in the classical sense. It is a filtering problem on structured tables. Every method we discuss is a different way of filtering rows of line_df (the document’s text) and toc_df (the document’s map). The mental model is closer to a SQL query than to a Google search. 检索的标准定义是“找到与查询最相似的段落”。这种定义具有误导性,因为它引入了错误的思维模型。一旦解析生成了干净的 DataFrame,检索就不再是传统意义上的搜索问题,而是结构化表上的过滤问题。我们讨论的每种方法都是过滤 line_df(文档文本)和 toc_df(文档映射)行记录的不同方式。这种思维模型更接近 SQL 查询,而非 Google 搜索。

This shift unlocks methods that don’t appear when you treat retrieval as free-text search: You can filter on columns: only lines in section X, only lines that match a regex, only chunks whose embedding is close to the query, only titles whose words intersect the question’s keywords. You can join tables: detect a keyword in line_df, then look up the section that line belongs to in toc_df, then weight the score by whether the section title is also relevant. You can run lightweight LLM filtering on the small tables (a toc_df with 50 entries) where you can’t run it on the large ones (a line_df with 12,000 lines, too many tokens for any single call). 这种转变解锁了将检索视为自由文本搜索时无法实现的方法:你可以按列过滤:仅限 X 章节的行、仅限匹配正则表达式的行、仅限嵌入向量与查询接近的块、仅限标题词汇与问题关键词相交的行。你可以连接表:在 line_df 中检测关键词,然后在 toc_df 中查找该行所属的章节,并根据章节标题的相关性对分数进行加权。你可以在小表(如 50 条记录的 toc_df)上运行轻量级 LLM 过滤,而无法在大表(如 12,000 行的 line_df,对于单次调用来说 token 太多)上运行。