Disagreement among frontier LLMs on real-world fact-checks
Disagreement among frontier LLMs on real-world fact-checks
前沿大模型在现实世界事实核查中的分歧
1. How often the frontier disagrees
1. 前沿模型出现分歧的频率
On 67% of claims (672 / 1,000; 95% CI: 64–70%), the frontier panel doesn’t agree — at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown: For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer (a strict majority)? If yes, how many of the remaining models dissented? If no clear majority emerged at all — verdicts split across three or four different buckets — the claim falls in the “Models split, no majority” row. 在 67% 的声明中(672/1,000;95% 置信区间:64–70%),前沿模型组未能达成一致——至少有一个模型与多数派结论相左,或者根本没有形成严格的多数派。具体分析如下:对于每一项声明,我们查看了五个前沿模型的结论并询问:是否至少有三个模型选择了相同的答案(严格多数)?如果是,剩余模型中有多少持不同意见?如果根本没有出现明确的多数派(结论分散在三个或四个不同的类别中),则该声明被归入“模型分歧,无多数派”一栏。
Most of these claims are unlikely to appear in any training corpus with a gold label attached — there’s no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to. We refer below to the “majority” and to “dissent from the majority.” A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. 这些声明大多不太可能出现在带有“金标准”标签的训练语料库中——没有可供模式匹配的权威答案,也没有可供锚定的基准排行榜。我们在下文中提到的“多数派”和“与多数派分歧”,并不意味着前沿模型的多数意见就是事实真相。多数派的结论有时是错误的,而个别持不同意见的模型有时反而是正确的。我们使用多数派作为衡量分歧的结构性参考点,而非正确性的替代指标。
(Table data omitted for brevity) (表格数据略)
Panel agreement: Krippendorff’s α (ordinal) = 0.639 (n=1000 claims, 5 raters). This indicates nontrivial but limited agreement: the models’ verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge. Ordinal α is the standard Krippendorff variant for an ordered categorical scale (True / Mostly True / Misleading / False). 专家组一致性: Krippendorff’s α(序数)= 0.639(n=1000 条声明,5 名评估者)。这表明存在非平凡但有限的一致性:模型的结论是有结构的而非随机的,但其一致性不足以将该专家组视为单一的可互换评判者。序数 α 是针对有序分类量表(真 / 基本为真 / 误导性 / 假)的标准 Krippendorff 变体。
Lower bound on model error: For each claim, exactly one of the four verdict buckets is the correct answer. If we assume the panel’s most popular bucket is the correct one — the most charitable assumption — the minimum number of models that picked a wrong verdict is: 模型错误的下限: 对于每一项声明,四个结论类别中只有一个是正确答案。如果我们假设专家组中最受欢迎的类别是正确的(这是最宽容的假设),那么选择错误结论的模型数量至少为:
- ≥1 model wrong on 67% of claims (any non-unanimous panel)
- ≥1 个模型在 67% 的声明中出错(任何非全票通过的专家组)
- ≥2 wrong on 45% of claims (3-2, 3-1-1, or no-majority splits)
- ≥2 个模型在 45% 的声明中出错(3-2、3-1-1 或无多数派的分歧)
- ≥3 wrong on 13% of claims (no bucket reaches a majority, so at most 2 can be right)
- ≥3 个模型在 13% 的声明中出错(没有类别达到多数,因此最多只有 2 个可能是对的)
2. Substantive vs nuance disagreement
2. 实质性分歧与细微差别分歧
On 34% of claims (343 / 1,000; 95% CI: 31–37%), at least two frontier models pick verdicts that are 2 or more buckets apart in our 4-bucket rubric — a disagreement that goes beyond calibration. Not every disagreement is equal. A “True” vs “Mostly True” split is a confidence-calibration shift. A “True” vs “False” split is a substantive disagreement about the answer. 在 34% 的声明中(343/1,000;95% 置信区间:31–37%),至少有两个前沿模型给出的结论在我们 4 级分类标准中相差 2 个或更多等级——这种分歧超出了校准范畴。并非所有的分歧都相同。“真”与“基本为真”的分歧属于置信度校准偏差,而“真”与“假”的分歧则是对答案本身的实质性分歧。
3. Model-vs-model agreement
3. 模型间的一致性
Highest peer agreement: Gemini 3 Pro × Gemini 3 Pro + Search (75%) — unsurprising, since they share a base model. Lowest: Claude Opus 4.7 × Gemini 3 Pro, Claude Opus 4.7 × Gemini 3 Pro + Search and Gemini 3 Pro × Sonar Pro (53%) — three pairs tie at the floor. 同行一致性最高的是:Gemini 3 Pro × Gemini 3 Pro + Search (75%)——这并不令人意外,因为它们共享同一个基础模型。最低的是:Claude Opus 4.7 × Gemini 3 Pro、Claude Opus 4.7 × Gemini 3 Pro + Search 以及 Gemini 3 Pro × Sonar Pro (53%)——这三对组合并列垫底。
4. Per-model behavior
4. 单个模型的表现
Two angles on the same five models: how each one distributes its verdicts (4.1), and how often each one’s verdict matches the strict majority of the other four (4.2). 从两个角度审视这五个模型:每个模型如何分布其结论(4.1),以及每个模型的结论与其余四个模型形成的严格多数派匹配的频率(4.2)。
4.1 Verdict distribution: Some models concentrate verdicts at the True/False poles; others distribute more broadly across the middle two buckets. This reflects model-level decision priors interacting with the specific claims — without ground truth, we can’t separate the two. 4.1 结论分布: 一些模型倾向于将结论集中在“真/假”两极;另一些则更广泛地分布在中间两个类别中。这反映了模型层面的决策先验与特定声明之间的相互作用——在没有事实真相的情况下,我们无法将两者区分开来。
4.2 Agreement with the rest of the panel: Across the five models, peer-majority agreement ranges from 69% to 81%. This is peer-alignment in this corpus, not correctness — no model is treated as ground truth here. 4.2 与专家组其余成员的一致性: 在这五个模型中,与同行多数派的一致性范围从 69% 到 81%。这是该语料库中的同行对齐度,而非正确性——在这里,没有任何模型被视为事实真相。