EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction

EntMTP：利用熵引导的多 Token 预测加速大模型推理

Abstract: Multi-token prediction has been shown to increase data density during training, improve downstream text-generation quality, and serves as the defacto approach for self-speculative decoding.

摘要： 多 Token 预测（Multi-token prediction）已被证明能够提高训练过程中的数据密度，改善下游文本生成的质量，并已成为自推测解码（self-speculative decoding）的事实标准方法。

Existing foundation and open source models that use MTP heads commit to a static tree-based attention topology throughout the entire generation sequence, meaning the speculation depth, and thus the compute required during verification, stays constant regardless of the context.

现有的使用 MTP（多 Token 预测）头的基础模型和开源模型，在整个生成序列中都采用静态的树状注意力拓扑结构。这意味着无论上下文如何，推测深度以及验证过程中所需的计算量都保持不变。

This is fundamentally misaligned with the entropy patterns of natural language where low-entropy regions often support reliable multi-step drafting, while high-entropy regions require more conservative speculation.

这与自然语言的熵模式存在根本性的偏差：在低熵区域，模型通常支持可靠的多步草拟；而在高熵区域，则需要更保守的推测策略。

To address this, we propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that toggles between tree-based attention topologies from a set of task-specific pareto-optimal trees conditioned on a running estimate of local generation entropy.

为了解决这一问题，我们提出了熵引导多 Token 预测（EntMTP）。这是一种无需训练的调度器，它能够根据局部生成熵的实时估计，在预设的一组特定任务帕累托最优树（pareto-optimal trees）之间切换树状注意力拓扑结构。

By matching speculation depth to context predictability, EntMTP maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing generation quality.

通过将推测深度与上下文的可预测性相匹配，EntMTP 在不牺牲生成质量的前提下，最大化了整个生成文本分布中的预期 Token 接受吞吐量。

When evaluated across Humaneval, ShareGPT, GSM8k, and Litbench benchmarks, EntMTP consistently achieves a 1.15x speedup against Hydra and peak speedup of 1.36x against Medusa baselines respectively.

在 Humaneval、ShareGPT、GSM8k 和 Litbench 基准测试中，EntMTP 表现出色，相较于 Hydra 基准实现了 1.15 倍的稳定加速，相较于 Medusa 基准则实现了最高 1.36 倍的加速。