Long Context vs. Short Context Model: When Does a Long Context Model Win?
Long Context vs. Short Context Model: When Does a Long Context Model Win?
长上下文与短上下文模型:长上下文模型何时胜出?
Artificial Intelligence Long Context vs. Short Context Model: When Does a Long Context Model Win? Balancing context capability against cost, speed, and data 人工智能长上下文与短上下文模型:长上下文模型何时胜出?在上下文能力与成本、速度及数据之间取得平衡
1. Introduction
1. 引言
1.1 The marketing claim, and the question it skips 1.1 营销口号与被忽略的问题
Each new generation of encoder models comes with a bigger context window. BERT and MiniLM gave us 512 tokens. Then ModernBERT arrived and pushed that to 8,192 — a 16× increase. This wasn’t just one team’s decision: the whole industry moved in the same direction, with the standard input limit for encoders and embedding models climbing from 512 to 8,192 tokens over just a few years (it can even get higher soon). (Figure 1). 每一代新的编码器(Encoder)模型都伴随着更大的上下文窗口。BERT 和 MiniLM 为我们提供了 512 个 token 的容量。随后 ModernBERT 的出现将其提升至 8,192 个——增长了 16 倍。这并非某个团队的孤立决策:整个行业都在向同一方向迈进,编码器和嵌入(Embedding)模型的标准输入限制在短短几年内从 512 个 token 攀升至 8,192 个(未来甚至可能更高)。(图 1)。
Figure 1: Max input window of representative encoders (blue) and embedding models (orange) by year — both families converged on 8192. Image by author 图 1:按年份划分的代表性编码器(蓝色)和嵌入模型(橙色)的最大输入窗口——两个系列都趋向于 8192。图片由作者提供
From Figure 1, you can see there are two related but distinct model families: Encoder and Embedding. They are both reshaped by the long-context increasing trend. An encoder (BERT, ModernBERT) is, in short, a tool that turns text into numbers that capture meaning. You can then fine-tune with a small task head, like a classification head, to serve your final purposes. An embedding model (sentence-transformers, nomic-embed, GTE/E5), on the other hand, turns text into numbers so you can compare or search. It takes an encoder one step further: it compresses an entire passage into a single fixed-length vector you can compare in a semantic search and RAG retrieval engine. 从图 1 可以看出,存在两个相关但不同的模型系列:编码器和嵌入模型。它们都受到长上下文增长趋势的重塑。简而言之,编码器(如 BERT、ModernBERT)是一种将文本转化为捕捉语义的数字的工具。你可以通过添加一个小型的任务头(如分类头)进行微调,以实现最终目标。另一方面,嵌入模型(如 sentence-transformers、nomic-embed、GTE/E5)将文本转化为数字,以便进行比较或搜索。它在编码器的基础上更进一步:将整段文字压缩成一个固定长度的向量,用于语义搜索和 RAG(检索增强生成)引擎中的比较。
Both encoder models and embedding models are built the same way under the hood — but they give you back something different. An encoder model gives you a separate representation for every single token in your input. That’s useful when you’re fine-tuning. An embedding model collapses all of that down into a single vector. That vector is built for comparison. 编码器模型和嵌入模型在底层构建方式上是相同的,但它们提供的输出不同。编码器模型为输入中的每一个 token 提供独立的表示,这在微调时非常有用。而嵌入模型将所有信息压缩成一个单一向量,该向量专为比较而设计。
Why is the context window getting longer? There’s a seductive idea floating around: “give the model more text, and it’ll understand more“. However, “we support 8192 tokens” is an engineering spec, not a performance guarantee. A model can technically accept 8192 tokens and still produce the same output it would have from just the first 512. Nobody really answers the awkward follow-up question: how much does that extra context actually help, and on what kinds of tasks? This article is here to find out, on a small 32M model, the kind of model you’d actually use in production because it’s cheap and fast at scale. We ran controlled experiments where context length was the only thing we changed. Everything else stayed fixed. 为什么上下文窗口越来越长?业界流传着一个诱人的观点:“给模型更多文本,它就能理解更多”。然而,“支持 8192 个 token”只是一个工程规格,而非性能保证。模型在技术上可以接收 8192 个 token,但输出结果可能与仅输入前 512 个 token 时完全相同。没有人真正回答那个尴尬的后续问题:额外的上下文到底有多大帮助?在哪些任务上有效?本文旨在通过一个小型 32M 模型来探究这一问题,这类模型因其在规模化生产中的低成本和高速度而被广泛使用。我们进行了对照实验,仅改变上下文长度,其余条件保持不变。
1.2 Why this matters: the cost is quadratic 1.2 为什么这很重要:成本是二次方的
Transformer attention scales with the square O(n²) of your sequence length. Going from 512 to 8192 tokens is 16× more input — but roughly 256× more compute. In this test, we measured a 22× wall-clock increase in training time on a binary patent task (35 s → 771 s), and a 30× increase on a 9-way patent task (93 s → 2,769 s). So the question isn’t whether longer context helps. It usually does. The question is whether it helps enough. Seven accuracy points? Pay it. A fraction of a point that flips across random seeds? You just lit money on fire. Hence, the engineering decision this study is built to inform is: You have a long document. You have a fixed task. Should you pay the quadratic cost of a 8192-token window — or will a cheap 512-token pass, or a simple chunking trick, get you close enough for a fraction of the price? Transformer 的注意力机制随序列长度的平方 O(n²) 扩展。从 512 个 token 增加到 8192 个 token,输入量增加了 16 倍,但计算量大约增加了 256 倍。在本次测试中,我们测量到在二元专利分类任务中,训练时间增加了 22 倍(35 秒 → 771 秒),在 9 分类专利任务中增加了 30 倍(93 秒 → 2,769 秒)。因此,问题不在于更长的上下文是否有帮助(通常是有帮助的),而在于它是否值得。如果能提升 7 个准确率点?值得。如果只是在不同随机种子下波动的一点点分数?那你就是在烧钱。因此,本研究旨在为以下工程决策提供参考:当你面对一份长文档和固定任务时,是应该支付 8192 token 窗口的二次方成本,还是通过廉价的 512 token 处理或简单的分块技巧,以极低的成本达到足够好的效果?
1.3 The answer: it’s about where the signal lives, not how long the document is 1.3 答案:关键在于信号的位置,而非文档的长度
The intuitive assumption is: longer document = more need for a long context window. That’s wrong. What matters isn’t document length — it’s where the useful information sits. As in Figure 2, a 5,000-token patent whose category is decided by the title, abstract, and first claim? It’s so obvious that a 512-token window already sees everything that matters. Extending it to token 4,000 adds nothing. But if the answer requires pieces scattered across the whole document, or only appears past token 512, that’s when a longer window actually earns its cost. 直觉假设是:文档越长,越需要长上下文窗口。这是错误的。重要的不是文档长度,而是有用信息所在的位置。如图 2 所示,一份 5,000 个 token 的专利,其类别由标题、摘要和第一项权利要求决定,那么 512 个 token 的窗口显然已经涵盖了所有关键信息。将其扩展到 4,000 个 token 毫无意义。但如果答案需要散布在整篇文档中的片段,或者信息仅出现在 512 个 token 之后,那么更长的窗口才真正物有所值。
Figure 2 — Three documents of identical length (8192 tokens) — only the signal’s position changes down the figure. When the signal is front-loaded it sits inside the first 512 tokens, so a cheap pass already sees it and the long window adds ~0. When it sits past 512 or is scattered end-to-end, only the full window reaches it. Length is held constant; what moves the verdict is where the signal lives. Image by author 图 2:三份长度相同(8192 个 token)的文档,仅信号位置不同。当信号位于前端(前 512 个 token 内)时,廉价的短窗口已能捕捉到,长窗口几乎无增益。当信号位于 512 之后或分散在全文时,只有完整窗口才能触及。文档长度保持不变,决定结果的是信号的位置。图片由作者提供
Document length and signal dispersion are two separate things — but they get treated as one. What the experiments actually show is uncomfortable: the long documents people classify in practice — patents, papers, legal filings — tend to front-load their key information. Which means the expensive 8192-token window is mostly re-reading what the cheap 512-token window already saw. 文档长度和信号分散度是两码事,但人们常将它们混为一谈。实验结果揭示了一个令人不安的事实:人们在实践中分类的长文档(如专利、论文、法律文件)往往将关键信息放在前端。这意味着昂贵的 8192 token 窗口大部分时间是在重复阅读 512 token 窗口已经看到的内容。
1.4 Who this is for, and what you’ll take away 1.4 本文受众与核心收获
Who. This is written for ML engineers and applied researchers who need to make a real decision about context length — whether that’s fine-tuning an encoder for long-document classification, building a RAG pipeline, or figuring out what inference costs look like when you’re serving a model at scale. You don’t need prior experience with long-context models. Part 2 explains all the techniques from scratch. 受众:本文专为需要对上下文长度做出实际决策的机器学习工程师和应用研究人员而写,无论是微调用于长文档分类的编码器、构建 RAG 流水线,还是评估大规模模型推理的成本。你无需具备长上下文模型的先验经验,第二部分将从零开始解释所有技术。
What you’ll walk away with: 核心收获:
- A simple decision rule. Instead of asking “how long is this document?”, you ask “where does the signal live?”. That question routes you to the right approach. It’s summarized as a decision tree you can apply directly to your own task. 简单的决策规则。 不要问“文档有多长?”,而要问“信号在哪里?”。这个问题会引导你选择正确的方法。文中总结了一个决策树,你可以直接应用于自己的任务。
- Actual cost numbers. What do 512 tokens vs. 8192 tokens actually cost you — in training time, inference time, on GPU, and on CPU? Once you see the numbers, “just use the longer context window” stops being a default and becomes a choice you’re pricing consciously. 实际成本数据。 512 个 token 与 8192 个 token 在训练时间、推理时间、GPU 和 CPU 上的实际成本差异是多少?一旦看到这些数据,“直接使用长上下文窗口”将不再是默认选项,而是一个经过深思熟虑的成本决策。
- Two cheaper techniques that often beat the long window. Chunk-and-pool. 两种通常优于长窗口的廉价技术。 分块与池化(Chunk-and-pool)。