SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

查看原文 / View Original 编辑 / Edit

Abstract: Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality.

摘要： 推测解码（Speculative Decoding, SD）通过使用轻量级草稿模型提出候选词元，并由目标模型并行验证，从而在不影响生成质量的前提下加速大语言模型（LLM）的推理过程。

While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations.

尽管基于检索的推测解码（RSD）因其即插即用的通用性而备受青睐，但其潜力受到僵化词汇依赖的限制，导致检索和验证过程对表层形式的变化非常敏感且脆弱。

To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms.

为了解决这一问题，我们提出了 SENSE（基于软门控评估的语义嵌入导航）。通过将检索锚定在目标模型的隐藏状态上，SENSE 建立了稳健的语义对齐，使软门控评估模块能够验证语义等价性，而非仅仅关注表层形式。

To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison.

为了确保基准测试的严谨性，我们将现有方法解构为统一框架内的原子基元，从而促进了细粒度的组件级比较。

Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.

在多个领域的广泛实验表明，SENSE 在 LLaMA 和 Qwen 系列模型上均优于多个基准方法，在保持生成质量的同时，实现了高达 4.09 的平均接受长度和 3.26 倍的加速。我们的代码将在论文发表后开源。