When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

当校准排名发生逆转：用于大语言模型公平比较的准确率控制评估

Abstract: Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration Error and Brier Score. We begin by showing, both theoretically and empirically, that such comparisons are confounded by differences in model accuracy.

摘要： 校准旨在评估模型的置信度是否与其经验准确率相一致。现有的研究通常使用全局校准指标（如预期校准误差 ECE 和 Brier 分数）来比较不同大语言模型的校准能力。我们首先从理论和实证两方面证明，此类比较会受到模型准确率差异的混淆。

For fairer cross-model comparison, we then propose ACE, an accuracy-controlled evaluation framework with three complementary views: Instance-Aligned, Distribution-Aligned, and Candidate-Aligned calibration. Across multiple benchmarks, model families, and confidence elicitation methods, we use ACE to study two practically important comparison axes, small versus large models and thinking versus non-thinking models.

为了实现更公平的跨模型比较，我们提出了 ACE，这是一个准确率控制评估框架，包含三个互补的视角：实例对齐（Instance-Aligned）、分布对齐（Distribution-Aligned）和候选对齐（Candidate-Aligned）校准。通过多个基准测试、模型系列和置信度诱导方法，我们利用 ACE 研究了两个具有实际意义的比较维度：小型模型与大型模型的对比，以及思维模型与非思维模型的对比。

We find that many previously reported calibration advantages under raw global metrics weaken substantially after accuracy control. We also find that ranking reversal is frequent: models favored by raw metrics often cease to be favored once accuracy can be controlled. Our results show that raw global calibration metrics are not robust for cross-model comparison, and that fair calibration comparison requires accuracy-aware evaluation.

我们发现，许多先前在原始全局指标下报告的校准优势，在进行准确率控制后显著减弱。我们还发现，排名逆转现象频繁出现：在原始指标下表现优异的模型，一旦控制了准确率变量，往往不再具有优势。我们的研究结果表明，原始的全局校准指标在跨模型比较中并不稳健，公平的校准比较需要引入准确率感知评估。