JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
JetFlow:通过并行树草稿突破投机解码的扩展瓶颈
Abstract: Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma.
摘要: 投机解码(Speculative Decoding, SD)通过并行草拟多个 Token 并进行验证,从而加速自回归大语言模型(LLM)的推理。然而,它面临着扩展瓶颈:只有在保持高接受率且草拟开销较低的情况下,增加草拟预算才能提升速度。由于现有的基于头部的投机解码方法面临“因果性与效率”的困境,这一瓶颈一直难以突破。
Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance.
自回归草拟器生成的路径条件候选词对于具有更高接受长度的树状投机解码非常有效,但其草拟成本会随着树深度的增加而增长。双向块扩散草拟器(Bidirectional block-diffusion drafters)可以在一次传递中生成所有位置的 Token,但它们与分支无关的边缘概率可能会形成虽然单个合理但相互矛盾的树,从而浪费预算并降低接受率。
We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model’s autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup.
我们提出了 JetFlow,这是一个基于头部的投机解码框架,它结合了单次前向草拟的高效性与分支因果条件约束。JetFlow 在冻结的目标模型融合隐藏状态之上训练一个因果并行草拟头,生成的候选树分数与目标模型的自回归分解保持一致。这使得 JetFlow 能够将更大的草拟预算转化为更长的接受前缀和更高的端到端加速比。
Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at this https URL.
在密集型和 MoE 架构的 Qwen3 模型上,针对数学、编程和对话基准测试,JetFlow 的表现始终优于双向头部和基于树的投机解码基线。在 H100 GPU 上,JetFlow 在 MATH-500 测试集上实现了高达 9.64 倍的加速,在开放式对话任务中实现了 4.58 倍的加速;通过集成到 vLLM 中,在实际服务负载下也展现出了进一步的延迟优化。我们的代码和模型已在链接中提供。