DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-V4:迈向高效的百万级上下文智能
Abstract: We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
摘要: 我们发布了 DeepSeek-V4 系列的预览版本,其中包括两个强大的混合专家(MoE)语言模型——拥有 1.6 万亿参数(激活 490 亿)的 DeepSeek-V4-Pro 和拥有 2840 亿参数(激活 130 亿)的 DeepSeek-V4-Flash,两者均支持一百万 token 的上下文长度。
DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability.
DeepSeek-V4 系列在架构和优化方面进行了多项关键升级:(1)结合了压缩稀疏注意力(CSA)和重度压缩注意力(HCA)的混合注意力架构,以提高长上下文的处理效率;(2)增强传统残差连接的流形约束超连接(mHC);(3)用于实现更快收敛和更高训练稳定性的 Muon 优化器。
We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks.
我们在超过 32 万亿个多样化且高质量的 token 上对这两个模型进行了预训练,随后通过全面的后训练流程解锁并进一步增强了它们的能力。DeepSeek-V4-Pro-Max 作为 DeepSeek-V4-Pro 的最大推理努力模式,重新定义了开源模型的行业标杆,在核心任务上超越了其前代产品。
Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible.
与此同时,DeepSeek-V4 系列在长上下文场景中表现出极高的效率。在一百万 token 的上下文设置下,与 DeepSeek-V3.2 相比,DeepSeek-V4-Pro 仅需 27% 的单 token 推理浮点运算量(FLOPs)和 10% 的 KV 缓存。这使我们能够常规化地支持百万 token 上下文,从而使长跨度任务和进一步的测试时扩展(test-time scaling)变得更加可行。