SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

SpecTr-GBV：多草稿块验证加速推测解码

Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model.

摘要： 自回归语言模型由于其顺序解码的特性，往往面临高推理延迟的问题。推测解码（Speculative Decoding, SD）通过采用轻量级草稿模型提出候选词元（tokens），并由更大的目标模型进行选择性验证，从而缓解了这一问题。

While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework.

尽管现有方法要么采用多草稿策略来提高接受率，要么采用块验证技术来联合验证多个词元，但它们通常将这些改进孤立处理，从而受到限制。在这项工作中，我们提出了 SpecTr-GBV，这是一种新颖的推测解码方法，它将多草稿和贪婪块验证（GBV）统一到了一个单一的框架中。

By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance length physically attainable within the framework of i.i.d. draft generation, and this bound improves as the number of drafts increases.

通过将验证步骤表述为草稿词元块与目标词元块之间的最优传输问题，SpecTr-GBV 在理论效率和实证性能上均有所提升。我们从理论上证明，在独立同分布（i.i.d.）草稿生成的框架内，SpecTr-GBV 达到了物理上可实现的最优预期接受长度，且该界限会随着草稿数量的增加而改善。

Empirically, we evaluate SpecTr-GBV across five datasets and four baselines. Our method achieves superior speedup and significantly higher block efficiency while preserving output quality. In addition, we perform comprehensive ablation studies to evaluate the impact of various hyperparameters in the model.

在实证方面，我们在五个数据集和四个基准模型上对 SpecTr-GBV 进行了评估。我们的方法在保持输出质量的同时，实现了卓越的加速效果和显著更高的块效率。此外，我们还进行了全面的消融研究，以评估模型中各种超参数的影响。