Context Compression Before the LLM: Cutting Tokens Without Cutting Recall
Context Compression Before the LLM: Cutting Tokens Without Cutting Recall
LLM 之前的上下文压缩:在不损失召回率的前提下减少 Token
You retrieve the top 10 chunks, paste them into the prompt, and send it to the model. Each chunk is 400 tokens. That is 4,000 tokens of context for a question whose answer lives in two sentences buried in chunk 6. You pay for all 4,000 on input. You also pay a quieter tax: the model has to find the answer inside a wall of near-miss text, and longer contexts degrade answer quality even when the right fact is present. 你检索出前 10 个数据块(chunks),将它们粘贴到提示词中,然后发送给模型。每个数据块为 400 个 token。这意味着对于一个答案隐藏在第 6 块中某两句话的问题,你却输入了 4,000 个 token 的上下文。你不仅要为这 4,000 个 token 支付输入费用,还要承担一种隐性成本:模型必须在大量无关紧要的文本中寻找答案,而且即使正确的事实就在其中,过长的上下文也会降低回答质量。
Stanford’s “Lost in the Middle” work showed it clearly. As input context grows, models reliably use information at the start and end and lose track of facts stuck in the middle (Liu et al., 2023). So the chunk that ranked sixth, sitting in the middle of your prompt, is exactly where the model is weakest. 斯坦福大学的“迷失在中间”(Lost in the Middle)研究清楚地表明了这一点。随着输入上下文的增加,模型能够可靠地利用开头和结尾的信息,却容易忽略夹在中间的事实(Liu et al., 2023)。因此,排名第六、位于提示词中间的数据块,恰恰是模型表现最薄弱的地方。
Context compression is the layer that sits between retrieval and generation. You retrieve generously, then squeeze the retrieved set down to the part that earns its place in the prompt. Two families do this: extractive and abstractive. They make different trade-offs, and most teams pick the wrong one for their data. 上下文压缩是位于检索和生成之间的一层。你可以先进行宽泛的检索,然后将检索到的内容压缩,只保留那些值得放入提示词的部分。实现这一目标的方法主要有两类:抽取式(Extractive)和生成式(Abstractive)。它们各有权衡,而大多数团队往往为自己的数据选择了错误的方法。
Extractive: keep the original sentences, drop the rest
抽取式:保留原始句子,舍弃其余部分
Extractive compression scores each unit of retrieved text against the query and keeps only the units that clear a bar. The text you keep is verbatim from the source. Nothing is rewritten, so nothing is hallucinated into the context. The simplest version works at the sentence level. Split each chunk into sentences, embed each sentence and the query, keep the sentences whose similarity beats a threshold. 抽取式压缩通过对比检索到的文本单元与查询的相关性进行评分,仅保留达到阈值的单元。你保留的文本完全来自原始来源,没有经过改写,因此不会引入幻觉。最简单的版本是基于句子层面的:将每个数据块拆分为句子,对每个句子和查询进行向量化(Embedding),保留相似度超过阈值的句子。
(Code omitted for brevity) (代码略)
The keep ratio is your dial. At keep=0.5 you drop half the sentences and roughly halve token cost on the context. The sentences you keep are the originals, so a citation that points back to the source still lines up word for word. The risk with sentence-level filtering is reference breakage. A sentence like “It expires after 30 days” scores low against the query “what is the refund policy” because it shares no keywords, even though it carries the actual answer. You cut it, and the model loses the qualifier. The fix is to keep a small window around every retained sentence so dangling pronouns and follow-on clauses survive. 保留比例(keep ratio)是你可以调节的参数。当 keep=0.5 时,你舍弃了一半的句子,上下文的 token 成本也大致减半。由于保留的是原始句子,指向来源的引用依然可以逐字对应。句子级过滤的风险在于引用断裂。例如,“它在 30 天后过期”这句话针对“退款政策是什么”这一查询的得分很低,因为它没有共享关键词,尽管它包含了实际答案。如果你把它删掉,模型就会丢失这个限定条件。解决方法是在每个保留句子的周围保留一个小窗口,这样悬空的代词和后续从句就能被保留下来。
A trained extractor does better than cosine similarity. Models like bge-reranker score query-sentence relevance as a cross-encoder, reading both together instead of comparing two separate embeddings. That catches the “it expires after 30 days” case more often, because the reranker sees the query and the sentence in one forward pass. It costs more per sentence, so run it on the candidate set after a cheap embedding filter, not on every sentence in the corpus. 经过训练的抽取器比余弦相似度效果更好。像 bge-reranker 这样的模型将查询与句子的相关性视为交叉编码器(cross-encoder),它们同时读取两者,而不是比较两个独立的向量。这能更有效地捕捉到“它在 30 天后过期”这类情况,因为重排序器(reranker)在一次前向传播中同时看到了查询和句子。由于每个句子的处理成本较高,建议先通过廉价的向量过滤筛选出候选集,再在候选集上运行重排序,而不是对语料库中的每个句子都运行。
Abstractive: rewrite the chunks into a tighter summary
生成式:将数据块重写为更紧凑的摘要
Abstractive compression sends the retrieved chunks to a small, fast model and asks it to write a query-focused summary. The output is new text. That is the appeal and the danger. 生成式压缩将检索到的数据块发送给一个小而快的模型,要求它编写一份以查询为中心的摘要。输出的是全新的文本。这就是它的吸引力所在,也是其危险之处。
(Code omitted for brevity) (代码略)
The compression ratio here can be far higher than extractive. The model can fold five paragraphs that circle the same fact into one sentence. On verbose corpora (support transcripts, meeting notes, legal boilerplate) abstractive can land in the 50-75% range where extractive is bounded nearer 30-50% by how many sentences you keep. Treat both as rough, corpus-dependent rules of thumb, not guaranteed figures. 这里的压缩比可以远高于抽取式。模型可以将围绕同一事实的五个段落合并为一句话。在冗长的语料库(如支持记录、会议纪要、法律条文)中,生成式压缩可以达到 50-75% 的压缩率,而抽取式则受限于你保留句子的数量,通常在 30-50% 左右。请将这两者视为依赖于语料库的经验法则,而非保证数值。
The cost is twofold. First, you add an LLM call before the answer call, which adds latency and spend. Second, the summarizer can drop a qualifier or smooth two facts into one wrong fact. “The discount is 10% for orders over $500” can come back as “there is a 10% discount” once the threshold gets summarized away. The mitigations: temperature 0, an explicit instruction to copy numbers and dates verbatim, and a NO_RELEVANT_CONTEXT escape hatch so the model is allowed to return nothing instead of inventing a connection. 成本体现在两个方面。首先,你在回答调用之前增加了一次 LLM 调用,这增加了延迟和开销。其次,摘要器可能会丢失限定条件,或者将两个事实合并为一个错误的事实。例如,“订单超过 500 美元可享受 10% 的折扣”在被总结后,可能会变成“有 10% 的折扣”,从而丢失了金额门槛。缓解措施包括:将 temperature 设置为 0,明确指示逐字复制数字和日期,并设置一个“NO_RELEVANT_CONTEXT”逃生舱,允许模型在没有相关内容时返回空值,而不是编造联系。
The trade-off, stated plainly
权衡总结
Pick on three axes: faithfulness, ratio, and cost. 请从三个维度进行选择:忠实度、压缩比和成本。
| Extractive (抽取式) | Abstractive (生成式) | |
|---|---|---|
| Faithful to source (对来源的忠实度) | high (verbatim) 高(逐字) | medium (rewritten) 中(重写) |
| Typical token cut (典型 Token 削减) | 30-50% | 50-75% |