EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction
EntMTP:利用熵引导的多 Token 预测加速大模型推理
Abstract: Multi-token prediction has been shown to increase data density during training, improve downstream text-generation quality, and serves as the defacto approach for self-speculative decoding.
摘要: 多 Token 预测(Multi-token prediction)已被证明能够提高训练过程中的数据密度,改善下游文本生成的质量,并已成为自推测解码(self-speculative decoding)的事实标准方法。
Existing foundation and open source models that use MTP heads commit to a static tree-based attention topology throughout the entire generation sequence, meaning the speculation depth, and thus the compute required during verification, stays constant regardless of the context.
现有的使用 MTP(多 Token 预测)头的基础模型和开源模型,在整个生成序列中都采用静态的树状注意力拓扑结构。这意味着无论上下文如何,推测深度以及验证过程中所需的计算量都保持不变。
This is fundamentally misaligned with the entropy patterns of natural language where low-entropy regions often support reliable multi-step drafting, while high-entropy regions require more conservative speculation.
这与自然语言的熵模式存在根本性的偏差:在低熵区域,模型通常支持可靠的多步草拟;而在高熵区域,则需要更保守的推测策略。
To address this, we propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that toggles between tree-based attention topologies from a set of task-specific pareto-optimal trees conditioned on a running estimate of local generation entropy.
为了解决这一问题,我们提出了熵引导多 Token 预测(EntMTP)。这是一种无需训练的调度器,它能够根据局部生成熵的实时估计,在预设的一组特定任务帕累托最优树(pareto-optimal trees)之间切换树状注意力拓扑结构。
By matching speculation depth to context predictability, EntMTP maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing generation quality.
通过将推测深度与上下文的可预测性相匹配,EntMTP 在不牺牲生成质量的前提下,最大化了整个生成文本分布中的预期 Token 接受吞吐量。
When evaluated across Humaneval, ShareGPT, GSM8k, and Litbench benchmarks, EntMTP consistently achieves a 1.15x speedup against Hydra and peak speedup of 1.36x against Medusa baselines respectively.
在 Humaneval、ShareGPT、GSM8k 和 Litbench 基准测试中,EntMTP 表现出色,相较于 Hydra 基准实现了 1.15 倍的稳定加速,相较于 Medusa 基准则实现了最高 1.36 倍的加速。