More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models
More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models
思考越多,偏见越深:推理模型中由长度驱动的位置偏见
Abstract: Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory.
摘要: 人们普遍认为,思维链(CoT)推理和经过推理微调的模型(如 DeepSeek-R1)通过深度思考能够减少浅层的启发式偏见。我们在多项选择题(MCQ)的位置偏见上对此进行了测试,结果却大相径庭:在任何具备推理能力的模型中,单题的位置偏见程度会随着推理轨迹长度的增加而扩大。
Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles.
在 MMLU、ARC-Challenge 和 GPQA 数据集上,针对 13 种推理模式配置(包括两个 R1 蒸馏的 7-8B 模型、两个使用 CoT 提示的基础模型,以及 671B 参数的 DeepSeek-R1)的测试显示,在控制准确率后,有 12 种配置表现出推理轨迹长度与位置偏见分数(PBS)之间的正偏相关,相关系数在 0.11 到 0.41 之间(所有 p < 0.05)。所有 12 种开源权重的推理模式配置在长度四分位数区间内,均表现出 PBS 的单调递增。
A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism.
截断干预实验提供了因果证据:从推理轨迹后期恢复的续写,越来越倾向于转向位置偏好选项(对于 R1-Qwen-7B 模型,在不同绝对位置桶中,这一倾向从 16% 上升至 32%)。在 671B 参数规模下,总体 PBS 降至 0.019,但长度效应在最长四分位数区间内依然显著(PBS = 0.071),这表明准确率只是抑制了长度驱动偏见的表现,而非消除了其潜在机制。
We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.
此外,我们发现直接回答的位置偏见是一种截然不同的现象,具有不同的特征(在 Llama-Instruct-direct 中较强,在 Qwen-Instruct-direct 中较弱,且与轨迹长度无关):CoT 推理用“长度累积偏见”取代了这种基准偏见。我们的研究结果表明,在 MCQ 评估流程中,不应默认将具备推理能力的模型视为对选项顺序具有鲁棒性,并提供了一套诊断工具包(PBS、承诺变更点、有效切换、截断探测)用于审计推理模型中的位置偏见。