LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
LaneRoPE:用于协同并行推理与生成的旋转位置编码
Abstract: Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences.
摘要: 并行大语言模型(LLM)的测试时扩展技术(例如 best-of-$N$)需要根据相同的输入提示词生成 $N>1$ 条序列。这些方法在利用 $N$ 个生成任务批处理的计算效率的同时,提升了准确性。然而,批处理中的每条序列传统上都是独立生成的,因此无法复用来自其他序列的中间生成结果、计算过程或观察信息。
In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence.
在本文中,我们提出了 LaneRoPE,旨在实现生成过程中 $N>1$ 条序列之间的协调与协作。LaneRoPE 包含两个核心理念:(a) 一种序列间注意力掩码(inter-sequence attention mask),使序列的采样过程相互依赖;(b) 一种 RoPE 扩展,通过注入位置信息来捕捉特定序列内部及序列之间的 token 相对位置。
We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.
我们在数学推理任务上评估了该方法,并发现了令人振奋的结果:LaneRoPE 实现了序列间的协作,在有限的生成序列长度下带来了额外的准确性提升。重要的是,由于 LaneRoPE 仅需对底层 LLM 架构进行极小的改动即可实现协调,且在推理时引入的开销微乎其微,因此它非常适合将并行推理快速集成到现有的 LLM 推理流水线中。