When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

语言模型何时做出承诺？一种关于预言化承诺的有限答案理论

Abstract: Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model’s answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model’s own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, $\delta(\xi)=S_\theta(\mathrm{yes}\mid\xi)-S_\theta(\mathrm{no}\mid\xi)$. This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes.

摘要： 语言模型在给出最终答案之前通常会生成推理过程，但可见的答案并不能揭示模型何时确定了其答案偏好。我们通过一个狭义的可计算对象——“有限答案偏好稳定性”（finite-answer preference stabilization）来研究这一问题。对于给定的模型状态和指定的答案词汇化器（verbalizers），我们将模型自身的延续概率投影到一个有限的答案集上；在二元任务中，这会产生一个精确的对数几率代码，即 $\delta(\xi)=S_\theta(\mathrm{yes}\mid\xi)-S_\theta(\mathrm{no}\mid\xi)$。该目标定义了基于解析器的答案起始点、回顾性稳定时间以及领先量，且无需依赖贪婪采样（greedy rollouts）或学习型探测器（learned probes）。

In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17—31 token mean lead in the main templates and positive, shorter lead in a parser-clean replication. The signal tracks the model’s eventual output rather than truth, is linearly recoverable from compact hidden summaries, is partly separable from cursor progress, and transfers as shared information without a single invariant coordinate. Diagnostics separate the measurement from online stopping, verbalizer-free belief, and causal answer control; exact steering shows local sensitivity of $\delta$ but not reliable generation control.

在针对 Qwen3-4B-Instruct 的受控延迟判定任务中，上下文有限答案投影在答案可被解析之前就已经稳定，在主要模板中平均领先 17 到 31 个 token，在经过解析器清理的复现实验中也表现出正向但较短的领先。该信号追踪的是模型的最终输出而非事实真相，可以从紧凑的隐藏摘要中线性恢复，与光标进度部分可分离，并作为共享信息进行传递，而无需单一的不变坐标。诊断分析将该测量结果与在线停止机制、无词汇化器的信念以及因果答案控制分离开来；精确的引导实验表明 $\delta$ 具有局部敏感性，但无法实现可靠的生成控制。