Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

超越答案的思考：评估大型推理模型中的有害过度思考

Abstract: Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: “Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?”

摘要： 大型推理模型（LRMs）通过增加测试时计算量来生成显式的中间推理轨迹，从而提升性能。然而，“推理时间越长越好”这一假设仍缺乏充分验证。尽管近期证据表明额外的推理可能导致模型“过度思考”，我们提出了一个问题：“当模型已经得出正确答案后，进一步的推理是优化了解决方案，还是偏离了它？”

To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory.

为了研究得出正确答案后的动态过程，我们引入了一种基于“推理充分性”的前缀级轨迹评估协议，定义了模型首次生成正确答案所需的最小推理预算。这使我们能够区分“冗余过度思考”（额外的推理是多余的但无害的）与“有害过度思考”（持续的推理破坏了原本正确的轨迹）。

Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time.

从多模态基准测试来看，我们发现许多被认为需要深度推理的案例，实际上仅需极少的推理步骤。此外，在首次出现正确前缀时停止推理，比标准推理的准确率最高提升了 21%。这揭示了当前模型的局限性不仅在于推理能力，还在于无法在合适的时机停止。

Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk.

此外，虽然诸如“提前停止”等常见的效率策略能显著减少冗余过度思考（最高可达 50%），但它们无法缓解有害的过度思考。故障分析显示，正确性的偏离主要是由逻辑漂移和视觉重解释导致的。最后，我们证明了研究结论同样适用于纯语言推理基准，这凸显了有害过度思考是一个更广泛的可靠性风险。