More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

思考越多，偏见越深：推理模型中由长度驱动的位置偏见

Abstract: Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory.

摘要： 人们普遍认为，思维链（CoT）推理和经过推理微调的模型（如 DeepSeek-R1）通过深度思考能够减少浅层的启发式偏见。我们在多项选择题（MCQ）的位置偏见上对此进行了测试，结果却大相径庭：在任何具备推理能力的模型中，单题的位置偏见程度会随着推理轨迹长度的增加而扩大。

Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles.

在 MMLU、ARC-Challenge 和 GPQA 数据集上，针对 13 种推理模式配置（包括两个 R1 蒸馏的 7-8B 模型、两个使用 CoT 提示的基础模型，以及 671B 参数的 DeepSeek-R1）的测试显示，在控制准确率后，有 12 种配置表现出推理轨迹长度与位置偏见分数（PBS）之间的正偏相关，相关系数在 0.11 到 0.41 之间（所有 p < 0.05）。所有 12 种开源权重的推理模式配置在长度四分位数区间内，均表现出 PBS 的单调递增。

A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism.

截断干预实验提供了因果证据：从推理轨迹后期恢复的续写，越来越倾向于转向位置偏好选项（对于 R1-Qwen-7B 模型，在不同绝对位置桶中，这一倾向从 16% 上升至 32%）。在 671B 参数规模下，总体 PBS 降至 0.019，但长度效应在最长四分位数区间内依然显著（PBS = 0.071），这表明准确率只是抑制了长度驱动偏见的表现，而非消除了其潜在机制。

We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.

此外，我们发现直接回答的位置偏见是一种截然不同的现象，具有不同的特征（在 Llama-Instruct-direct 中较强，在 Qwen-Instruct-direct 中较弱，且与轨迹长度无关）：CoT 推理用“长度累积偏见”取代了这种基准偏见。我们的研究结果表明，在 MCQ 评估流程中，不应默认将具备推理能力的模型视为对选项顺序具有鲁棒性，并提供了一套诊断工具包（PBS、承诺变更点、有效切换、截断探测）用于审计推理模型中的位置偏见。