Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

Dustin:用于高效长上下文生成的草稿增强型稀疏验证与推测解码

Abstract: While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency. Existing compression methods fail in this regime: static eviction incurs accuracy loss due to saliency shift, while dynamic selection introduces prohibitive computational overhead during the verification path.

摘要: 虽然推测解码(Speculative Decoding)提高了多批次长上下文大语言模型(LLM)的推理吞吐量,但其效率往往受限于验证瓶颈,即键值(KV)缓存加载占据了大部分延迟。现有的压缩方法在此场景下表现不佳:静态剔除(Static Eviction)因显著性偏移(Saliency Shift)导致精度损失,而动态选择(Dynamic Selection)则在验证路径中引入了过高的计算开销。

We propose Dustin, a sparse verification framework designed for long-context speculative decoding. Dustin integrates lookahead signals from the draft model with historical attention from the target model to identify critical tokens with high fidelity across multi-step verification windows. To reduce recomputation latency, this approach further employs a sparse estimation scheme that restricts importance scoring to a minimal subset of attention heads.

我们提出了 Dustin,这是一个专为长上下文推测解码设计的稀疏验证框架。Dustin 将草稿模型(Draft Model)的前瞻信号与目标模型的历史注意力机制相结合,能够在多步验证窗口中高保真地识别关键 Token。为了降低重计算延迟,该方法进一步采用了一种稀疏估计方案,将重要性评分限制在极少数的注意力头(Attention Heads)子集上。

Evaluations on PG-19 and LongBench with Qwen2.5-72B demonstrate that Dustin achieves a 27.85x speedup in self-attention and a 9.17x end-to-end decoding speedup at a 32k sequence length, all with negligible accuracy degradation.

在 PG-19 和 LongBench 数据集上使用 Qwen2.5-72B 进行的评估表明,在 32k 序列长度下,Dustin 实现了 27.85 倍的自注意力加速和 9.17 倍的端到端解码加速,且精度损失几乎可以忽略不计。