LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

LaneRoPE：用于协同并行推理与生成的旋转位置编码

Abstract: Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences.

摘要： 并行大语言模型（LLM）的测试时扩展技术（例如 best-of-$N$）需要根据相同的输入提示词生成 $N>1$ 条序列。这些方法在利用 $N$ 个生成任务批处理的计算效率的同时，提升了准确性。然而，批处理中的每条序列传统上都是独立生成的，因此无法复用来自其他序列的中间生成结果、计算过程或观察信息。

In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence.

在本文中，我们提出了 LaneRoPE，旨在实现生成过程中 $N>1$ 条序列之间的协调与协作。LaneRoPE 包含两个核心理念：(a) 一种序列间注意力掩码（inter-sequence attention mask），使序列的采样过程相互依赖；(b) 一种 RoPE 扩展，通过注入位置信息来捕捉特定序列内部及序列之间的 token 相对位置。

We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

我们在数学推理任务上评估了该方法，并发现了令人振奋的结果：LaneRoPE 实现了序列间的协作，在有限的生成序列长度下带来了额外的准确性提升。重要的是，由于 LaneRoPE 仅需对底层 LLM 架构进行极小的改动即可实现协调，且在推理时引入的开销微乎其微，因此它非常适合将并行推理快速集成到现有的 LLM 推理流水线中。