When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

何时“学会停止”才有用？推理模型中提前退出机制的成本感知研究

Abstract: Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with LearnStop, a hidden-state-free checkpoint stopper for reasoning language models. At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking-marker density.

摘要： 推理模型在处理不同实例时会消耗不同量的有效计算资源，但目前尚不清楚何时使用“学习型停止规则”会优于简单的置信度或收敛阈值。我们通过 LearnStop 研究了这一问题，这是一种用于推理语言模型的无隐藏状态检查点停止器。在固定的预算检查点，LearnStop 会从当前的推理前缀中探测简短答案，并根据在线特征（如答案置信度、熵、前缀投票份额、答案稳定性和回溯标记密度）来预测前缀的正确性。

Across 18 task-model settings spanning GSM8K, MATH-500, MMLU-Pro, AIME-90, GPQA, Qwen3, and DeepSeek-R1 distillations, the answer is task-dependent. On free-form math, learned multi-feature stopping improves the fixed-budget frontier and often beats scalar exits: on GSM8K with Qwen3-32B, the empirical frontier reaches a post-hoc peak adapt gain of +0.157, validation-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is +0.028.

在涵盖 GSM8K、MATH-500、MMLU-Pro、AIME-90、GPQA、Qwen3 和 DeepSeek-R1 蒸馏模型的 18 种任务模型设置中，结论取决于具体任务。在自由形式的数学任务中，学习型多特征停止机制改善了固定预算下的性能边界，且通常优于标量退出机制：在 Qwen3-32B 运行 GSM8K 时，经验边界达到了 +0.157 的事后峰值自适应增益，验证集选定的工作点保持了正向增益，且相对于最强标量基线的配对增益为 +0.028。

On multiple-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger. We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure. We further provide validation-selected operating points, paired bootstrap tests, finite-grid lost-correct risk calibration, cost accounting under KV-fork, prefix-cache, and black-box regimes, H100 serving profiles, checkpoint-schedule sweeps, transfer analyses, and robustness checks.

在多项选择题和极高难度设置中，标量置信度、熵或稳定性规则更具竞争力或表现更强。因此，我们将学习型停止机制定位为一种价值取决于轨迹结构的工具，而非标量退出机制的通用替代品。此外，我们还提供了验证集选定的工作点、配对自助法检验、有限网格丢失-正确风险校准、KV 分叉/前缀缓存/黑盒模式下的成本核算、H100 服务配置、检查点调度扫描、迁移分析以及鲁棒性检查。

The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem.

主要的实践发现是：当许多问题在达到完整预算前就已经得出正确答案，但并未表现出单一可靠的标量停止信号时，学习型停止机制非常有用；而当置信度或答案收敛性已经能够解决停止问题时，其优势基本消失。