The Untaught Lessons of RAG Retrieval: Cosine Is Not the Foundation
The Untaught Lessons of RAG Retrieval: Cosine Is Not the Foundation
RAG 检索中未被传授的经验:余弦相似度并非基石
Large Language Model The Untaught Lessons of RAG Retrieval: Cosine Is Not the Foundation Enterprise Document Intelligence [Vol.1 #7ter] – Six positions on the retrieval brick that contradict the cosine-first reflex of mainstream RAG. Kezhan Shi Jul 3, 2026 7 min read.
大型语言模型 RAG 检索中未被传授的经验:余弦相似度并非基石。企业文档智能 [第 1 卷 #7ter] —— 关于检索模块的六个观点,它们反驳了主流 RAG 中“余弦相似度优先”的条件反射。Kezhan Shi,2026 年 7 月 3 日,阅读时长 7 分钟。
This article is a manifesto companion to Enterprise Document Intelligence, the series whose philosophy is laid out in Amplify the Expert. It zooms in on brick 3 (retrieval) of the four-brick architecture and surfaces the lessons most tutorials skip.
本文是《企业文档智能》系列的宣言配套文章,该系列的哲学思想在《放大专家》(Amplify the Expert)一文中有所阐述。本文聚焦于四模块架构中的第三个模块(检索),并揭示了大多数教程所忽略的经验。
The mainstream story has retrieval as embed the question, return top-k by cosine, optionally rerank. We disagree with almost every part of it. Retrieval is filtering on structured tables, not searching free text. Embeddings are the optional fallback, not the foundation. Anchor and context are two granularities, not one. Each of these is a position we can defend, with consequences you can measure.
主流观点认为检索就是:嵌入问题,通过余弦相似度返回 Top-k 结果,并可选择性地进行重排序。我们几乎不同意其中的每一个环节。检索是对结构化表格的过滤,而非对自由文本的搜索。嵌入(Embeddings)是可选的后备方案,而非基石。锚点(Anchor)和上下文(Context)是两种粒度,而非一种。每一个观点我们都能给出论证,并能带来可衡量的结果。
Lesson 1 – Retrieval is filtering, not searching
经验 1 —— 检索是过滤,而非搜索
Once parsing is done, retrieval is a SQL-like filtering problem over line_df and toc_df, the reverse of the chunk-embed-cosine-top-k framing. The shift is simple to state: the question has columns, the document has columns, and retrieval is the join.
一旦解析完成,检索就是一个针对 line_df 和 toc_df 的类 SQL 过滤问题,这与“分块-嵌入-余弦相似度-Top-k”的框架截然相反。这种转变可以简单表述为:问题有列,文档有列,而检索就是连接(Join)。
Why it matters. Search and filter are not synonyms, the two operations have different mechanics. Search scores every candidate on a continuous similarity (cosine, BM25), forces a top-k cutoff, and always returns something, even when the answer is not in the document. Filter applies a boolean condition (line.contains(“X”), toc.title in […]), retains every row that matches and no more, and can return zero rows when the document does not carry the answer.
为什么这很重要?搜索和过滤并非同义词,这两种操作的机制不同。搜索通过连续相似度(余弦相似度、BM25)对每个候选对象进行评分,强制执行 Top-k 截断,并且即使文档中没有答案,它也总是会返回某些内容。过滤则应用布尔条件(如 line.contains("X"),toc.title in [...]),保留所有匹配的行且仅保留这些行,当文档中不包含答案时,它可以返回零行结果。
Lesson 2 – Anchor and context, kept apart
经验 2 —— 锚点与上下文,需保持分离
You anchor on the single line that mentions “premium” (precise) but pass the whole surrounding section to generation (sufficient context); conflating them breaks precision and coverage in one move. Top-k forces you to pick: tiny chunks lose context, huge chunks lose precision. We get both, by keeping them apart.
你锚定在提到“保费”(premium)的那一行(精确),但将整个周围章节传递给生成模型(充足的上下文);将两者混为一谈会同时破坏精确度和覆盖率。Top-k 迫使你做出选择:微小的分块会丢失上下文,巨大的分块会丢失精确度。通过将它们分开,我们两者兼得。
Lesson 3 – Embeddings come last, not first
经验 3 —— 嵌入是最后一步,而非第一步
Keywords always run (cheap, deterministic); the document’s own TOC is a first-class retrieval method; embeddings are the optional final signal, only when vocabulary mismatch is expected. The 2024-era reflex starts with embeddings; we leave them for the cases where the cheaper signals failed.
关键词检索总是会运行(廉价、确定性);文档自身的目录(TOC)是一种一流的检索方法;嵌入是可选的最终信号,仅在预期会出现词汇不匹配时使用。2024 年代的条件反射是先使用嵌入;我们将它们留给更廉价的信号失效的情况。
Lesson 4 – Keywords prove absence; embeddings cannot
经验 4 —— 关键词能证明缺失;嵌入则不能
A zero on keyword search means the answer is genuinely not there; a zero on embedding similarity could be absence or just different words, so embeddings are a refinement, not a decision gate. This asymmetry is the case for keywords as the primary signal in enterprise RAG.
关键词搜索结果为零意味着答案确实不存在;而嵌入相似度为零可能是因为缺失,也可能仅仅是因为用词不同,因此嵌入是一种优化手段,而非决策门槛。这种不对称性正是关键词作为企业 RAG 主要信号的原因。
Lesson 5 – Co-occurrence beats BM25 on narrow corpora
经验 5 —— 在窄领域语料库中,共现性优于 BM25
BM25 ranks by term frequency, but the enterprise answer shape is one mention of a topic next to a specific value, so co-occurrence…
BM25 根据词频进行排序,但企业级答案的形态通常是某个主题的提及紧邻着一个特定数值,因此共现性……