Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
超越答案的思考:评估大型推理模型中的有害过度思考
Abstract: Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: “Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?”
摘要: 大型推理模型(LRMs)通过增加测试时计算量来生成显式的中间推理轨迹,从而提升性能。然而,“推理时间越长越好”这一假设仍缺乏充分验证。尽管近期证据表明额外的推理可能导致模型“过度思考”,我们提出了一个问题:“当模型已经得出正确答案后,进一步的推理是优化了解决方案,还是偏离了它?”
To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory.
为了研究得出正确答案后的动态过程,我们引入了一种基于“推理充分性”的前缀级轨迹评估协议,定义了模型首次生成正确答案所需的最小推理预算。这使我们能够区分“冗余过度思考”(额外的推理是多余的但无害的)与“有害过度思考”(持续的推理破坏了原本正确的轨迹)。
Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time.
从多模态基准测试来看,我们发现许多被认为需要深度推理的案例,实际上仅需极少的推理步骤。此外,在首次出现正确前缀时停止推理,比标准推理的准确率最高提升了 21%。这揭示了当前模型的局限性不仅在于推理能力,还在于无法在合适的时机停止。
Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk.
此外,虽然诸如“提前停止”等常见的效率策略能显著减少冗余过度思考(最高可达 50%),但它们无法缓解有害的过度思考。故障分析显示,正确性的偏离主要是由逻辑漂移和视觉重解释导致的。最后,我们证明了研究结论同样适用于纯语言推理基准,这凸显了有害过度思考是一个更广泛的可靠性风险。