How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

思考多少才足够？量化并理解大语言模型推理中的冗余

Abstract: Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles.

摘要： 具备推理能力的大语言模型通过输出长思维链来解决难题，但为此付出了高昂的延迟、GPU 时间和能源成本。粗略观察其推理过程可以发现，其中存在大量的重述、验证和循环自省。然而，这些深思熟虑中究竟有多少是真正必要的，此前从未在大规模范围内进行过测量，也未从第一性原理上得到解释。

This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while $\pi$, forced to terminate thinking and emit a final answer, still produces the correct answer.

本文填补了这两项空白。我们直接根据推理模型本身对推理冗余进行了形式化定义：一个正确推理轨迹的冗余度，是指在其末尾分段步骤中，在强制模型 $\pi$ 停止思考并输出最终答案的情况下，仍能保持正确答案所能截断的最大比例。

A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high — between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions — that the finding is robust to the choice of judge family, and that although $\rho$ decreases with problem difficulty on MATH-500, all four models remain substantially redundant ($\rho \in [46%, 85%]$) even on the hardest Level-5 problems.

通过对四个前沿推理模型和两个数学基准测试的大规模量化分析显示，步骤级的冗余度始终很高——在我们研究的 8 种（模型、基准）条件下，冗余度在 61% 到 93% 之间；在 8 种条件中的 6 种里，中位数关键前缀仅为一个分段步骤。这一发现对于不同评估者家族的选择具有稳健性；此外，尽管在 MATH-500 数据集上 $\rho$ 值随问题难度增加而下降，但所有四个模型即使在最难的 Level-5 问题上，依然保持了显著的冗余性（$\rho \in [46%, 85%]$）。

We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained.

随后我们证明，这种冗余是与长度无关的结果奖励（length-agnostic outcome rewards）所导致的结构性后果，而非特定模型的产物：在任何此类奖励机制下，不存在最优的有限期望停止时间。该结论不依赖于强化学习算法、基础模型、数据分布，也不论策略是通过强化学习还是蒸馏获得；因此，“过度思考”并非单个模型中需要修复的漏洞，而是当前推理模型训练方式的一种结构性属性。