When Should a Language Model Trust Itself? Same-Model Self-Verification as a Conditional Confidence Signal

语言模型何时该信任自己？作为条件置信度信号的同模型自我验证

Abstract: Same-model self-verification, prompting a model to audit its own predicted answer, is a plausible confidence signal for selective prediction, but its practical value remains unclear once strong likelihood-based baselines are taken seriously. 摘要： “同模型自我验证”（Same-model self-verification）——即提示模型审查其自身预测的答案——是选择性预测中一种合理的置信度信号。然而，一旦考虑到基于似然（likelihood-based）的强基准模型，其应用价值仍不明确。

We evaluate self-verification against two such baselines, LL-AVG and LL-SUM, on ARC-Challenge and TruthfulQA-MC across multiple model families, scales, and prompt variants. We measure not only correctness ranking, but also abstention quality through AURC and operating-point analyses. 我们针对 ARC-Challenge 和 TruthfulQA-MC 数据集，在多个模型系列、规模和提示词变体中，将自我验证与 LL-AVG 和 LL-SUM 这两个基准进行了对比评估。我们不仅测量了正确性排序，还通过 AURC（拒绝曲线下面积）和工作点分析评估了拒绝回答的质量。

The results are sharply task- and model-dependent. On ARC-Challenge, self-verification substantially improves over LL-AVG for Phi-2 and the Qwen models, with the largest gains appearing in Qwen-7B. 结果显示，该方法具有极强的任务和模型依赖性。在 ARC-Challenge 上，对于 Phi-2 和 Qwen 模型，自我验证相比 LL-AVG 有显著提升，其中 Qwen-7B 的增益最为明显。

On TruthfulQA-MC, however, the signal is less reliable: smaller models can become prompt-sensitive, DeepSeek-R1-Distill-8B degrades relative to LL-AVG, and LL-SUM often remains the stronger practical baseline. 然而，在 TruthfulQA-MC 上，该信号的可靠性较低：较小的模型可能会对提示词变得敏感，DeepSeek-R1-Distill-8B 的表现相较于 LL-AVG 有所下降，且 LL-SUM 通常仍是更具实用性的基准。

We therefore do not treat self-verification as a general-purpose uncertainty estimator. In this setting, it is better understood as a conditional confidence signal whose value depends on task type, model family, prompt formulation, and, crucially, the baseline it must beat. 因此，我们并不将自我验证视为一种通用的不确定性估计器。在这种背景下，将其理解为一种“条件置信度信号”更为准确，其价值取决于任务类型、模型系列、提示词构建方式，以及最关键的——它所需要超越的基准模型。