DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-V4：迈向高效的百万级上下文智能

Abstract: We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.

摘要： 我们发布了 DeepSeek-V4 系列的预览版本，其中包括两个强大的混合专家（MoE）语言模型——拥有 1.6 万亿参数（激活 490 亿）的 DeepSeek-V4-Pro 和拥有 2840 亿参数（激活 130 亿）的 DeepSeek-V4-Flash，两者均支持一百万 token 的上下文长度。

DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability.

DeepSeek-V4 系列在架构和优化方面进行了多项关键升级：（1）结合了压缩稀疏注意力（CSA）和重度压缩注意力（HCA）的混合注意力架构，以提高长上下文的处理效率；（2）增强传统残差连接的流形约束超连接（mHC）；（3）用于实现更快收敛和更高训练稳定性的 Muon 优化器。

We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks.

我们在超过 32 万亿个多样化且高质量的 token 上对这两个模型进行了预训练，随后通过全面的后训练流程解锁并进一步增强了它们的能力。DeepSeek-V4-Pro-Max 作为 DeepSeek-V4-Pro 的最大推理努力模式，重新定义了开源模型的行业标杆，在核心任务上超越了其前代产品。

Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible.

与此同时，DeepSeek-V4 系列在长上下文场景中表现出极高的效率。在一百万 token 的上下文设置下，与 DeepSeek-V3.2 相比，DeepSeek-V4-Pro 仅需 27% 的单 token 推理浮点运算量（FLOPs）和 10% 的 KV 缓存。这使我们能够常规化地支持百万 token 上下文，从而使长跨度任务和进一步的测试时扩展（test-time scaling）变得更加可行。