Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
通过结构不确定性量化大语言模型逻辑推理的一致性
Abstract: Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently — a failure mode especially prevalent in multi-step deductive reasoning. 摘要: 大语言模型可以通过不稳定、自相矛盾或难以进行一致性排序的推理路径得出相同的答案——这种失效模式在多步演绎推理中尤为普遍。
Existing methods assess reliability primarily through output dispersion — measuring how much sampled answers differ — but this discards a complementary signal: whether the model can consistently rank competing reasoning candidates. 现有的评估方法主要通过输出离散度(即测量采样答案之间的差异程度)来评估可靠性,但这忽略了一个互补的信号:模型是否能够一致地对相互竞争的推理候选方案进行排序。
We propose structural uncertainty, a consistency-aware framework derived from the stability of self-preference-induced rankings over sampled reasoning solutions. 我们提出了“结构不确定性”(structural uncertainty),这是一个基于自偏好诱导排序在采样推理方案上的稳定性而得出的、具有一致性感知能力的框架。
Given a query, we generate multiple candidate solutions and ask the model to judge pairwise preferences among its own outputs. 给定一个查询,我们生成多个候选解决方案,并要求模型对其自身的输出进行两两偏好判断。
We aggregate self-preferences into ranking distributions via Bradley-Terry modeling with PageRank, and decompose the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. 我们通过结合 PageRank 的 Bradley-Terry 模型将自偏好聚合为排序分布,并将该信号分解为两个基于熵的组件:跨试验排序不稳定性(across-trial ranking instability)和试验内候选方案模糊性(within-trial candidate ambiguity)。
Across five LLMs and eight benchmarks, structural signals provide information complementary to answer dispersion: on logical and mathematical reasoning tasks, the combination improves identification of unreliable instances, while on factual retrieval the structural signal collapses toward uniformity, diagnosing a regime boundary where reasoning-level consistency evaluation is uninformative. 在五个大语言模型和八个基准测试中,结构信号提供了与答案离散度互补的信息:在逻辑和数学推理任务中,两者的结合提高了对不可靠实例的识别能力;而在事实检索任务中,结构信号趋于一致,这诊断出了一个推理级一致性评估失效的边界范围。
The two components relate differently to accuracy: within-trial ambiguity correlates positively with correctness — consistent with settings where multiple plausible solution paths remain competitive — while across-trial instability correlates negatively, signaling unreliable reasoning. 这两个组件与准确率的关系各不相同:试验内模糊性与正确性呈正相关——这与多个合理的解决方案路径同时存在的情况相符;而跨试验不稳定性则与正确性呈负相关,预示着推理过程不可靠。
Structural uncertainty is best understood not as a universal confidence estimator, but as a regime-sensitive evaluator of logical reasoning consistency. 结构不确定性最好不要被理解为一种通用的置信度估计器,而应被视为一种对逻辑推理一致性的、具有范围敏感性的评估工具。