Query Rewriting Before Retrieval: The Cheap Recall Win Most Skip

检索前的查询重写：大多数人忽略的低成本召回优化

A user types “how do I cancel” into your support bot. Your retriever embeds those three words, runs a cosine search, and hands the model five chunks about cancelling a payment, cancelling a meeting invite, and the word “cancel” appearing in a changelog. The chunk about cancelling a subscription, the thing the user actually wanted, ranked seventh. The model answers from what it got. The answer is wrong, and nobody on your team can see why. 用户在你的支持机器人中输入“我该如何取消”。你的检索器将这三个词向量化，运行余弦相似度搜索，并交给模型五个片段：关于取消付款、取消会议邀请，以及在更新日志中出现的“取消”一词。而用户真正想要的关于“取消订阅”的片段排在第七位。模型根据它获取的内容进行回答。结果是错误的，且你的团队中没人能看出原因。

The query was the problem. It was too short, too ambiguous, and missing the context the corpus was written with. Most teams reach for a bigger embedding model or a reranker to fix this. Both help. Both are more expensive than the thing that fixes the query before it ever hits the index: query rewriting. 问题出在查询本身。它太短、太模糊，且缺失了语料库编写时所具备的上下文。大多数团队会选择使用更大的嵌入模型或重排序器（reranker）来解决这个问题。两者都有帮助，但相比于在查询进入索引前进行修正的方法——查询重写（query rewriting），它们的成本都更高。

Query rewriting sits in front of retrieval. You take the raw user query, run it through one cheap LLM call, and search with something better. Two patterns carry most of the win: multi-query expansion and step-back rewriting. Here is how each works, what it costs, and how to decide whether the latency is worth it. 查询重写位于检索之前。你获取原始用户查询，通过一次低成本的 LLM 调用进行处理，然后使用更好的查询进行搜索。两种模式最能带来收益：多查询扩展（multi-query expansion）和后退重写（step-back rewriting）。以下是它们的工作原理、成本以及如何判断延迟是否值得。

Why the raw query is a bad search key

为什么原始查询是一个糟糕的搜索键

Embedding the user’s literal words assumes the user phrased the question the way the document phrased the answer. They almost never do. Users write short, lexically thin queries. “covid policy”, “refund”, “how do I cancel”. Your documents are written by someone else, months earlier, with different vocabulary and full context. The embedding of “how do I cancel” lands in a neighborhood crowded with every cancellation in the corpus. The single relevant chunk is in there, but it is not at the top, and your top-k cut throws it away. 将用户的字面词汇进行嵌入，假设了用户提问的方式与文档回答的方式一致。但事实几乎从不如此。用户编写的查询通常很短，词汇贫乏，例如“新冠政策”、“退款”、“我该如何取消”。而你的文档是由他人在几个月前编写的，使用了不同的词汇和完整的上下文。将“我该如何取消”进行嵌入后，它会落入语料库中所有关于“取消”的密集区域。那个唯一相关的片段就在其中，但它不在顶部，而你的 top-k 截断机制会将其丢弃。

Rewriting fixes the mismatch on the query side, where it is cheap, instead of the index side, where it is not. You are reshaping the search key so it lands closer to the answer. 重写是在查询端解决这种不匹配问题，这里的成本很低，而不是在索引端（那里的成本很高）。你正在重塑搜索键，使其更接近答案。

Multi-query expansion

多查询扩展

The idea: one query is one shot at the index. Generate several phrasings of the same intent, search with each, then merge the results. More phrasings means more lexical and semantic surface area, which means a higher chance one of them lands near the right chunk. You ask the LLM for a handful of variants, run each as its own search, and fuse the rankings with reciprocal rank fusion. RRF rewards documents that several variants agree on, so a chunk that shows up in three of four searches floats to the top even if no single search ranked it first. 核心思想：一个查询是对索引的一次尝试。生成同一意图的几种不同表述，分别进行搜索，然后合并结果。更多的表述意味着更大的词汇和语义覆盖面，这意味着其中一个查询落入正确片段附近的几率更高。你让 LLM 生成几个变体，将每个变体作为独立的搜索运行，并使用倒数排名融合（RRF）来合并排名。RRF 会奖励多个变体都认可的文档，因此即使没有任何单一搜索将其排在第一位，一个在四次搜索中出现了三次的片段也会浮动到顶部。

(Code omitted for brevity) (代码略)

Two details decide whether this helps or just burns tokens. Always keep the original query in the set; the LLM sometimes drifts and the original is your anchor. And use RRF, not a union-and-dedupe. A plain union throws away rank order, which is the only signal that tells you which agreed-upon document to trust. Multi-query earns its keep on vague and concept-style queries, the ones where the user’s intent is fuzzy and a few rephrasings explore it. The variants run in parallel, so the latency cost is one LLM call plus the slowest of N searches, not N searches end to end. 有两个细节决定了这是否真的有帮助，还是仅仅在浪费 Token。务必保留原始查询；LLM 有时会偏离主题，而原始查询是你的锚点。此外，使用 RRF，而不是简单的并集去重。简单的并集会丢弃排名顺序，而排名顺序是告诉你该信任哪个共识文档的唯一信号。多查询在模糊和概念型查询中非常有效，即用户意图不明确、需要通过几种重述来探索的情况。变体是并行运行的，因此延迟成本是一次 LLM 调用加上 N 次搜索中最慢的那一次，而不是 N 次搜索的总和。

Step-back rewriting

后退重写

Step-back goes the other direction. Instead of more variants of the same question, you ask the LLM to back up one level of abstraction and pose the broader question first. The technique comes from a 2023 Google DeepMind paper, Take a Step Back, which found that prompting a model to reason about a general principle before the specific question improved its answers on reasoning benchmarks. The same move helps retrieval. A narrow query like “what is the late-payment penalty for tier 3 enterprise accounts” is so specific that the matching chunk has to contain almost those exact words. The step-back version, “how does billing handle late payments across account tiers”, retrieves the policy section that actually defines the penalty in context. 后退重写则走向另一个方向。与其生成同一问题的多个变体，不如让 LLM 后退一个抽象层级，先提出更广泛的问题。该技术源自 2023 年 Google DeepMind 的论文《Take a Step Back》，研究发现，提示模型在回答具体问题前先推理一般原则，可以提高其在推理基准测试中的表现。同样的策略也有助于检索。像“三级企业账户的逾期付款罚款是多少”这样狭窄的查询过于具体，匹配的片段必须包含几乎完全相同的词汇。而“计费系统如何处理各账户层级的逾期付款”这一后退版本，则能检索到在上下文中定义了罚款的政策章节。

(Code omitted for brevity) (代码略)

Notice the retrieve function searches with both the original. 注意，检索函数同时使用了原始查询进行搜索。