How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
语言模型如何失败:推理过程中“既定错误”与“持续不确定性”的 Token 级特征
Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes.
摘要: 语言模型的推理失败通过不同的过程产生,并在推理轨迹中留下可识别的特征。我们利用 Token 级的置信度信号对这些失败进行了表征,发现它们源于两种在经验上可区分的过程。
The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection.
第一种是“既定错误”(committed failure),即模型在推理轨迹的早期就锁定了一条错误的推理路径。其核心诊断特征是“承诺点”(commitment point),超过该点后,考虑额外的 Token 反而会阻碍而非辅助对失败的检测。
In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework’s falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes.
第二种是“持续不确定性”(persistent uncertainty),不确定性在整个过程中不断累积,需要完整的推理轨迹才能最好地将失败的补全与成功的补全区分开来。这些特征在 23 种模型-数据集配置中均可复现,该框架的可证伪预测在 23 个案例中有 20 个成立,在两种失败模式下均远高于随机概率。
Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.
最后,我们证明了我们的失败模式框架对“自洽性”(self-consistency)具有直接影响,能够识别何时不确定性信号可以作为其补充,以及何时可以有选择地跳过它。这些结果为理解何时可以检测到大语言模型的推理失败,以及如何相应地调整检测策略奠定了基础。