GLM-5.2: Built for Long-Horizon Tasks
GLM-5.2: Built for Long-Horizon Tasks
We’re introducing GLM-5.2, our latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a solid 1M-token context. 我们正式推出 GLM-5.2,这是我们针对长程任务(Long-Horizon Tasks)打造的最新旗舰模型。相较于前代 GLM-5.1,该模型在长程任务处理能力上实现了质的飞跃,并首次在 100 万 token 的上下文窗口下提供了稳健的性能支持。
GLM-5.2’s new capabilities include: Solid 1M Context: A solid 1M-token context that stably sustains long-horizon work; Advanced Coding with Flexible Effort: Stronger coding capabilities with multiple thinking effort levels to balance performance and latency; Improved Architecture: We propose IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9× at a 1M context length. We also improve GLM-5.2’s MTP layer for speculative decoding, increasing the acceptance length by up to 20%; Pure Open: An MIT open-source license — no regional limits, technical access without borders. GLM-5.2 的新特性包括:稳健的 1M 上下文:能够稳定支持长程工作的 100 万 token 上下文;灵活的进阶编程能力:通过多种思维努力程度(Thinking Effort Levels)平衡性能与延迟,提供更强的编程能力;架构优化:我们提出了 IndexShare 技术,通过在每四个稀疏注意力层中复用同一个索引器,将 1M 上下文长度下的单 token 计算量(FLOPs)降低了 2.9 倍。此外,我们还改进了用于投机采样的 MTP 层,使接受长度提升了高达 20%;纯粹开源:采用 MIT 开源协议,无区域限制,实现无国界的技术访问。
Supporting long-horizon tasks starts with making long context engineering-usable: the model must maintain quality across long, messy coding-agent trajectories, not just accept more tokens. A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure. To this end, we substantially expanded 1M-context training for coding-agent scenarios, covering large-scale implementation, automated research, performance optimization, and complex debugging. The result is a long-context system that is not only wide in scope, but solid in execution: a practical substrate for sustained engineering work. 支持长程任务的前提是让长上下文具备工程可用性:模型不仅要能接收更多 token,还必须在漫长且复杂的编程智能体(Coding-Agent)轨迹中保持高质量输出。宣称支持 1M 上下文很容易,但在真实的工程压力下保持可靠性则难得多。为此,我们针对编程智能体场景大幅扩展了 1M 上下文的训练,涵盖了大规模实现、自动化研究、性能优化及复杂调试。最终,我们打造出了一个不仅覆盖范围广,且执行稳健的长上下文系统,成为支撑持续性工程工作的实用基石。
This capability is reflected in GLM-5.2’s performance on three long-horizon coding benchmarks. FrontierSWE measures whether an agent can complete open-ended technical projects at the scale of hours to tens of hours, spanning systems optimization, large-scale code construction, and applied ML research. On this benchmark, GLM-5.2 trails Opus 4.8 by only 1%, while edging out GPT-5.5 by 1% and Opus 4.7 by 11%. 这种能力体现在 GLM-5.2 在三项长程编程基准测试中的表现上。FrontierSWE 用于衡量智能体是否能完成耗时数小时至数十小时的开放式技术项目,涵盖系统优化、大规模代码构建及应用机器学习研究。在该基准测试中,GLM-5.2 仅落后 Opus 4.8 1%,同时领先 GPT-5.5 1%,领先 Opus 4.7 11%。
On PostTrainBench, where each agent is given an H100 GPU and evaluated by how much it can improve small models through post-training, GLM-5.2 outperforms both Opus 4.7 and GPT-5.5, ranking second only to Opus 4.8. On SWE-Marathon, an ultra-long-horizon software engineering benchmark covering tasks such as building compilers, optimizing kernels, and developing production-grade services, GLM-5.2 still has room to grow, trailing Opus 4.8 by 13% while remaining second only to the Opus series. 在 PostTrainBench 测试中(为每个智能体提供一块 H100 GPU,评估其通过后训练提升小模型的能力),GLM-5.2 表现优于 Opus 4.7 和 GPT-5.5,仅次于 Opus 4.8 位列第二。在超长程软件工程基准测试 SWE-Marathon 中(涵盖构建编译器、优化内核及开发生产级服务等任务),GLM-5.2 仍有提升空间,落后 Opus 4.8 13%,但依然仅次于 Opus 系列。
Across all three benchmarks, GLM-5.2 is the highest-ranked open-source model, showing that its 1M context has translated into practical long-horizon delivery capability. On standard coding benchmarks, GLM-5.2 is the strongest open-source model, improving on GLM-5.1 by a wide margin: 81.0 vs. 63.5 on Terminal-Bench 2.1 and 62.1 vs. 58.4 on SWE-bench Pro. It also closes much of the gap to the closed-source frontier — on Terminal-Bench 2.1 (81.0) it lands within a few points of Claude Opus 4.8 (85.0) — while staying ahead of Gemini 3.1 Pro. 在上述三项基准测试中,GLM-5.2 均是排名最高的开源模型,证明其 1M 上下文已转化为实用的长程交付能力。在标准编程基准测试中,GLM-5.2 是最强的开源模型,较 GLM-5.1 有显著提升:在 Terminal-Bench 2.1 上从 63.5 提升至 81.0,在 SWE-bench Pro 上从 58.4 提升至 62.1。它还大幅缩小了与闭源前沿模型的差距——在 Terminal-Bench 2.1 上(81.0 分)与 Claude Opus 4.8(85.0 分)仅有几分之差,同时领先于 Gemini 3.1 Pro。
GLM-5.2 also introduces effort level control, enabling users to explicitly balance model capability against task execution speed and computational cost. As shown in the figure, GLM-5.2 delivers substantially stronger agentic coding performance than GLM-5.1 at comparable token budgets, with its capability roughly positioned between Claude Opus 4.7 and Claude Opus 4.8 under similar token consumption. Moreover, the Max effort level allows users to allocate additional computation when higher performance is required in challenging tasks, further extending the model’s coding capability. This design gives users greater flexibility when using GLM-5.2 for coding tasks, allowing them to select the most suitable reasoning mode for different scenarios. GLM-5.2 还引入了“努力程度控制”(Effort Level Control),使用户能够明确平衡模型能力与任务执行速度及计算成本。如图所示,在相当的 token 预算下,GLM-5.2 展现出比 GLM-5.1 强得多的智能体编程性能,其能力在相似 token 消耗下大致处于 Claude Opus 4.7 和 Claude Opus 4.8 之间。此外,“Max”努力程度允许用户在处理高难度任务需要更高性能时分配额外计算资源,进一步扩展了模型的编程能力。这种设计为用户在进行编程任务时提供了更大的灵活性,使其能够针对不同场景选择最合适的推理模式。
Architecture for 1M Context: IndexShare for DSA
1M 上下文架构:用于 DSA 的 IndexShare
To support 1M context length, in GLM-5.2, we apply IndexShare to reduce the computational cost of the indexer in DSA. Specifically, in GLM-5.2, every 4 transformer layers share a lightweight indexer. The indexer is placed at the first of 4 layers and topk indices are used for 4 layers. This reduces the computation of indexer dot product and topk operation in 3/4 layers. GLM-5.2 is trained with IndexShare from mid-training with 128K sequence length, outperforming GLM-5.1 on long-context benchmarks with less computation. 为支持 1M 上下文长度,GLM-5.2 应用了 IndexShare 技术以降低 DSA(稀疏注意力)中索引器的计算成本。具体而言,GLM-5.2 中每 4 个 Transformer 层共享一个轻量级索引器。索引器放置在 4 层中的第一层,其 topk 索引供 4 层共同使用。这减少了 3/4 层中索引器点积和 topk 操作的计算量。GLM-5.2 从 128K 序列长度的中期训练阶段开始使用 IndexShare 进行训练,在计算量更少的情况下,长上下文基准测试表现优于 GLM-5.1。
MTP with IndexShare and KVShare
结合 IndexShare 与 KVShare 的 MTP
We improve the MTP layer of GLM-5.2 for speculative decoding with two objectives: 1) Minimize the cost of the MTP layer as draft model; 2) Maximize the acceptance rate of speculative decoding. For the first objective, we also apply IndexShare on the mtp layer. In multi-step MTP, the indexer is placed on the first step and topk indices are used for all the following steps. However, different from the backbone, the input tokens of different mtp steps are different. As the following figure shows, if we reuse the topk indices of $h_4$ for $h_5$, $h_5$ can only attend to $h_1$ to $h_4$, but not $h_5$. We will show that the property can help us achieve the second objective, by eliminating the training-inference discrepancy in GLM-5.1’s mtp layer. 我们改进了 GLM-5.2 的 MTP 层以优化投机采样,目标有二:1) 最小化作为草稿模型的 MTP 层成本;2) 最大化投机采样的接受率。针对第一个目标,我们同样在 MTP 层应用了 IndexShare。在多步 MTP 中,索引器放置在第一步,后续所有步骤均使用相同的 topk 索引。然而,与主干网络不同,不同 MTP 步骤的输入 token 是不同的。如下图所示,如果我们为 $h_5$ 复用 $h_4$ 的 topk 索引,$h_5$ 将只能关注到 $h_1$ 到 $h_4$,而无法关注到 $h_5$ 本身。我们将证明这一特性有助于实现第二个目标,即消除 GLM-5.1 MTP 层中存在的训练与推理差异。
In the above figure we show the inference of a two-step MTP layer. In the first step, inference is consistent with training, with all the hidden states coming from the target model. However, in the second step, $h_{1:4}$ come from the target model and $h_5$ comes from the mtp layer. Therefore, the KV cache of $h_5$ is a mixture of $kv_{1:4}$ computed from the target model and $kv_5$ computed from the mtp layer. Instead, with IndexShare, the KV cache of $h_5$ includes only $kv_{1:4}$, all from the hidden states of the target model. For training, we reuse both kv cache and topk indices of the first mtp step. Note that the same as GLM-5.1, the parameters of different MTP steps are also shared. Furthermore, inspired by https://arxiv.org/abs/2606.12370, we introduce rejection sampling for speculative decoding. 上图展示了两步 MTP 层的推理过程。第一步中,推理与训练保持一致,所有隐藏状态均来自目标模型。但在第二步中,$h_{1:4}$ 来自目标模型,而 $h_5$ 来自 MTP 层。因此,$h_5$ 的 KV cache 是由目标模型计算出的 $kv_{1:4}$ 和 MTP 层计算出的 $kv_5$ 混合而成。相比之下,使用 IndexShare 后,$h_5$ 的 KV cache 仅包含来自目标模型隐藏状态的 $kv_{1:4}$。在训练时,我们复用了第一步 MTP 的 KV cache 和 topk 索引。请注意,与 GLM-5.1 一样,不同 MTP 步骤的参数也是共享的。此外,受 https://arxiv.org/abs/2606.12370 启发,我们为投机采样引入了拒绝采样机制。