Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost

重排序器(Reranker)也不是魔法:何时值得引入交叉编码器(Cross-Encoder)层

Enterprise Document Intelligence [Vol. 1 #2bis] 企业文档智能 [第1卷 #2bis]

Why stacking a reranker on top of weak retrieval doesn’t save it, what cross-encoders actually fix vs what they don’t, and where the editorial position of the series lands. 为什么在薄弱的检索系统之上堆叠重排序器并不能解决问题?交叉编码器究竟能修复什么,又不能修复什么?本系列文章的编辑立场又是如何定位的?


Same setup as the embeddings article. Two situations. 场景与之前的嵌入(Embeddings)文章相同。这里有两种情况。

Scene 1. A team building a RAG system over a few hundred contracts has read Article 2. Embeddings break on negation, on exact identifiers, on the gap between a question and its answer. The team’s first reflex is the one the literature suggests: add a reranker. Cross-encoder, smaller than an LLM, smarter than cosine, slot it between embeddings and the LLM. They wire in bge-reranker-base, send it the top-100 from the embedding stage, keep the top-10. A few queries that were broken yesterday seem to work today. The team is encouraged. 场景 1。 一个正在为数百份合同构建 RAG 系统的团队阅读了第 2 篇文章。嵌入模型在处理否定句、精确标识符以及问题与答案之间的语义鸿沟时会失效。团队的第一反应是文献中建议的做法:添加一个重排序器。交叉编码器比大模型(LLM)更小,比余弦相似度更聪明,将其置于嵌入模型和大模型之间。他们接入了 bge-reranker-base,将嵌入阶段的前 100 个结果发送给它,并保留前 10 个。昨天无法处理的一些查询今天似乎奏效了。团队深受鼓舞。

Scene 2. Two weeks in, the same operational pattern from Article 2 returns. The user asks “list every clause that mentions termination” and the system returns the three “most relevant” ones, exactly three, ranked. The contract has eleven. The user asks “what’s the cancellation rule for non-employees?” The reranker has never seen the company’s term “non-employee labor,” and ranks an unrelated paragraph on top. The user asks “is there a clause that does NOT mention indemnification?” Same negation failure as before; the cross-encoder doesn’t see logical complementation any more than the embedding did. Latency, meanwhile, is now in the hundreds of milliseconds. The cross-encoder runs at query time on every candidate, and there’s no way to precompute it. 场景 2。 两周后,第 2 篇文章中提到的相同操作模式再次出现。用户询问“列出所有提及终止条款的内容”,系统返回了三个“最相关”的条款,不多不少,正好三个。而合同中实际上有十一条。用户询问“非员工的取消规则是什么?”重排序器从未见过公司内部术语“非员工劳务(non-employee labor)”,并将一个不相关的段落排在了首位。用户询问“是否有条款没有提及赔偿?”依然出现了和之前一样的否定句处理失败;交叉编码器在处理逻辑补集时并不比嵌入模型表现得更好。与此同时,延迟现在达到了数百毫秒。交叉编码器在查询时对每个候选对象运行,且无法预先计算。

Worse: when they run side-by-side comparisons against text-embedding-3-large without the reranker, the embedding alone often matches or beats ada-002 + bge-reranker-base. The classical retrieval funnel looks the same way it did in Article 2. Cheap embedding similarity at the bottom narrows millions of candidates to thousands. An optional cross-encoder reranker between narrows the thousands to dozens. The chat-completion LLM on top reads the dozens. The reranker is the layer that sits between two large constants on the cost-and-quality ladder. Knowing what each stage really does is what makes the funnel work; expecting magic from any single stage is how teams lose six months. 更糟糕的是:当他们将不带重排序器的 text-embedding-3-largeada-002 + bge-reranker-base 进行对比时,仅使用嵌入模型往往就能达到甚至超过后者的效果。经典的检索漏斗看起来与第 2 篇文章中描述的一样:底层的廉价嵌入相似度将数百万候选对象缩小到数千个;中间可选的交叉编码器重排序器将数千个缩小到数十个;顶层的聊天补全大模型读取这数十个结果。重排序器是位于成本与质量阶梯上两个大常量之间的层。了解每个阶段的实际作用是让漏斗发挥作用的关键;指望任何单一阶段产生“魔法”只会让团队浪费六个月的时间。


1. What a reranker actually is

1. 重排序器究竟是什么

Before the empirical tests, the architectural picture. Two reasons it matters: the reranker is a real engineering object with real costs, and the editorial position the series defends only makes sense once the classical role is on the table. 在进行实证测试之前,先来看架构图。这很重要,原因有二:重排序器是一个具有实际成本的工程对象,且本系列文章所捍卫的编辑立场只有在明确了其经典角色后才具有意义。

1.1 The cost/precision gradient

1.1 成本/精度梯度

Three stages, ordered by cost per query: 按单次查询成本排序的三个阶段:

  • Bi-encoder embedding similarity. A precomputed vector per document. At query time the model encodes the query once and runs cosine similarity against the index. Milliseconds for millions of candidates. Cheap and approximate. 双编码器(Bi-encoder)嵌入相似度。 每个文档对应一个预计算向量。在查询时,模型对查询进行一次编码,并针对索引运行余弦相似度计算。处理数百万候选对象仅需毫秒级时间。廉价且近似。

  • Cross-encoder reranker. Query and passage are tokenised together and passed through a transformer that attends across both. The output is a single relevance score per pair. Cannot be precomputed because the query is part of the input. Tens of milliseconds per pair. Mid-cost, mid-precision. 交叉编码器(Cross-encoder)重排序器。 查询和段落被一起分词,并通过一个对两者进行联合注意力计算的 Transformer。输出是每对组合的单一相关性得分。由于查询是输入的一部分,因此无法预计算。每对组合需数十毫秒。中等成本,中等精度。

  • Chat-completion LLM. Reads a small candidate set and produces a structured answer. Hundreds of milliseconds, dollars per million tokens. Most expensive, most accurate. 聊天补全大模型(LLM)。 读取一小部分候选集并生成结构化答案。需数百毫秒,每百万 token 成本以美元计。最昂贵,最准确。

Each stage is justified by what it can do cheaper than the next stage above. Embeddings can’t do everything an LLM can, but they can score a million candidates in the time the LLM reads ten. Rerankers can’t do everything an LLM can, but they can… 每个阶段的存在理由都是因为它能以比上一阶段更低的成本完成任务。嵌入模型无法完成大模型能做的所有事情,但它们可以在大模型读取十个候选对象的时间内为一百万个候选对象打分。重排序器无法完成大模型能做的所有事情,但它们可以……