DeepSeek-V4: a million-token context that agents can actually use
DeepSeek-V4: a million-token context that agents can actually use
DeepSeek-V4:智能体真正可用的百万级上下文
DeepSeek released V4 today. Two MoE checkpoints are on the Hub: DeepSeek-V4-Pro at 1.6T total parameters with 49B active, and DeepSeek-V4-Flash at 284B total with 13B active. Both have a 1M-token context window. The benchmark numbers are competitive, but not SOTA. It doesn’t matter. The real innovation is how DeepSeek v4 is designed for efficient large context length support, and hence as one of the best candidates for agentic tasks.
DeepSeek 今日发布了 V4 版本。Hugging Face 上现已提供两个 MoE 检查点:DeepSeek-V4-Pro(总参数 1.6T,激活参数 49B)和 DeepSeek-V4-Flash(总参数 284B,激活参数 13B)。两者均支持 100 万 token 的上下文窗口。虽然基准测试数据表现出色但并非业内顶尖(SOTA),但这并不重要。真正的创新在于 DeepSeek-V4 如何通过设计实现高效的长上下文支持,从而成为智能体(Agent)任务的最佳候选模型之一。
Focusing on long running agentic workloads. Running a frontier open model as an agent today breaks in predictable ways. The model stops. You reprompt. The trace blows past the context budget, or the KV cache fills the GPU, or tool-call round trips degrade halfway through a long task. V4 is built to fix these known failures, and point the way for the community to follow. This post covers three things: what the architecture does differently to make long-context inference cheap, the agent-specific post-training decisions that compound on top of it, and some takeaways from the paper that help reason about these changes.
该模型专注于长时间运行的智能体工作负载。目前,将前沿开源模型作为智能体使用时,往往会遇到可预见的故障:模型停止响应、需要重新提示、追踪记录超出上下文预算、KV Cache 填满 GPU 显存,或者在长任务中途工具调用往返延迟增加。V4 的构建旨在修复这些已知缺陷,并为社区指明方向。本文将涵盖三个方面:架构在降低长上下文推理成本方面的独特之处、针对智能体优化的后训练决策,以及从论文中总结出的有助于理解这些变革的要点。
The KV cache problem for agents
智能体的 KV Cache 问题
A 1M context window is just capacity, not performance. Whether you can use it depends on the cost of every forward pass at that depth. For an agent running a long tool-use trajectory (a SWE-bench task, a multi-step browse session, a terminal session with hundreds of commands), every tool result is appended to the context, and every subsequent token pays the full attention cost against everything that came before. Two numbers matter: single-token inference FLOPs and KV cache size. Both grow with sequence length.
100 万的上下文窗口仅仅是容量,而非性能。能否真正使用它,取决于在该深度下每次前向传播的成本。对于执行长工具使用轨迹的智能体(如 SWE-bench 任务、多步浏览会话或包含数百条命令的终端会话)而言,每个工具的结果都会被追加到上下文中,后续的每个 token 都要为之前的所有内容支付完整的注意力计算成本。有两个指标至关重要:单 token 推理的浮点运算量(FLOPs)和 KV Cache 大小。两者都会随序列长度的增加而增长。
At 1M tokens, DeepSeek-V4-Pro requires 27% of single-token inference FLOPs compared with DeepSeek-V3.2, so it runs faster on the same hardware. It also uses 10% of the KV cache memory. V4-Flash drops these numbers even further: 10% of the FLOPs and 7% of the KV cache. If we compare the KV cache memory against a established architecture like grouped query attention with 8 heads, stored in the usual bfloat16 format, DeepSeek v4 requires roughly 2% the cache size. This makes it much easier to deploy for very large context handling.
在 100 万 token 的长度下,DeepSeek-V4-Pro 的单 token 推理 FLOPs 仅为 DeepSeek-V3.2 的 27%,因此在相同硬件上运行速度更快。它占用的 KV Cache 内存也仅为后者的 10%。V4-Flash 的表现更进一步:FLOPs 降至 10%,KV Cache 降至 7%。如果与采用 8 头分组查询注意力(GQA)机制、以常规 bfloat16 格式存储的成熟架构相比,DeepSeek-V4 的缓存需求仅为其 2% 左右。这使得它在处理超长上下文时更易于部署。
Hybrid attention: CSA and HCA
混合注意力:CSA 与 HCA
The efficiency gain comes from splitting attention into two mechanisms and interleaving them across layers. Compressed Sparse Attention (CSA) compresses KV entries by 4x along the sequence dimension using softmax-gated pooling with a learned positional bias. A lightning indexer (FP4, ReLU-scored multi-head dot product) picks the top-k compressed blocks per query. It inherits the sparse-selection idea from DeepSeek Sparse Attention in V3.2, but runs it over blocks that are already 4x shorter than the original sequence. The indexer’s search space shrinks with it.
效率的提升源于将注意力机制拆分为两种,并在各层之间交替使用。压缩稀疏注意力(CSA)通过带有学习位置偏置的 Softmax 门控池化,将序列维度的 KV 条目压缩 4 倍。一个轻量级索引器(FP4,ReLU 评分的多头点积)为每个查询选择 Top-k 个压缩块。它继承了 V3.2 中 DeepSeek 稀疏注意力的稀疏选择思想,但在已经比原始序列短 4 倍的块上运行,索引器的搜索空间也随之缩小。
Heavily Compressed Attention (HCA) compresses KV entries by 128x and drops the sparse selection. Every query attends densely to every compressed block. The compressed sequence is short enough that dense attention is cheap. The layers alternate between CSA and HCA. Different layers carry different attention patterns, and forcing one mechanism across all of them wastes capacity. In V4-Pro’s 61-layer stack, layers 0–1 are HCA, layers 2–60 alternate CSA and HCA, and the MTP block at the end runs sliding-window only. Both paths use FP8 storage for most KV entries and BF16 only for the RoPE dimensions. The lightning indexer inside CSA runs in FP4. These storage choices compound with the compression ratios to produce the 2% KV cache figure.
重度压缩注意力(HCA)将 KV 条目压缩 128 倍并去掉了稀疏选择。每个查询都会密集地关注每一个压缩块。由于压缩后的序列足够短,密集注意力计算成本很低。各层在 CSA 和 HCA 之间交替。不同的层承载不同的注意力模式,在所有层强制使用同一种机制会浪费容量。在 V4-Pro 的 61 层堆栈中,第 0-1 层为 HCA,第 2-60 层交替使用 CSA 和 HCA,末端的 MTP 块仅运行滑动窗口。两条路径的大多数 KV 条目均使用 FP8 存储,仅 RoPE 维度使用 BF16。CSA 内部的轻量级索引器以 FP4 运行。这些存储选择与压缩比相结合,最终实现了仅 2% 的 KV Cache 占用。
What changes for agents
智能体有哪些变化
Efficient long-context attention is necessary for agent workflows but not sufficient. The paper describes three post-training and infrastructure choices that target agent use cases directly.
高效的长上下文注意力对于智能体工作流是必要的,但还不够。论文描述了三种直接针对智能体用例的后训练和基础设施选择。
Interleaved thinking across tool calls: V3.2 kept reasoning traces across tool-result rounds but discarded them whenever a new user message arrived. For an agent handling a single user turn, this was fine. For multi-turn agentic workflows, where the user sends a follow-up after the agent has already chained several tool calls, the model lost its accumulated reasoning and had to reconstruct state. V4 preserves reasoning content across user message boundaries when the conversation contains tool calls. The model retains the complete reasoning history across all rounds, including across user turns. This allows a coherent, cumulative chain of thought over long-horizon agent tasks.
跨工具调用的交替思考: V3.2 在工具结果轮次中保留推理轨迹,但每当有新的用户消息到达时就会丢弃它们。对于处理单轮用户请求的智能体来说这没问题。但在多轮智能体工作流中,当智能体已经链接了多个工具调用后用户发送后续消息时,模型会丢失积累的推理结果并不得不重建状态。V4 在对话包含工具调用时,会跨用户消息边界保留推理内容。模型在所有轮次(包括跨用户轮次)中保留完整的推理历史。这使得在长周期的智能体任务中能够实现连贯、累积的思维链。
Tool-call schema with dedicated tokens: V4 introduces a |DSML| special token and an XML-based tool-call format. The XML format reduces escaping failures compared to JSON-in-string tool calls, a common failure mode when models emit nested quoted content. The schema separates string parameters (passed as-is with string=“true”) from structured parameters (passed as JSON with string=“false”). This removes a class of parsing errors around numbers and booleans that JSON too.
带有专用 token 的工具调用模式: V4 引入了 |DSML| 特殊 token 和基于 XML 的工具调用格式。与字符串中的 JSON 工具调用相比,XML 格式减少了转义失败的情况(这是模型输出嵌套引号内容时的常见故障模式)。该模式将字符串参数(以 string=“true” 原样传递)与结构化参数(以 string=“false” 的 JSON 格式传递)分离开来。这消除了一类围绕数字和布尔值的解析错误。