Disagreement among frontier LLMs on real-world fact-checks

前沿大模型在现实世界事实核查中的分歧

1. How often the frontier disagrees

1. 前沿模型出现分歧的频率

On 67% of claims (672 / 1,000; 95% CI: 64–70%), the frontier panel doesn’t agree — at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown: For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer (a strict majority)? If yes, how many of the remaining models dissented? If no clear majority emerged at all — verdicts split across three or four different buckets — the claim falls in the “Models split, no majority” row. 在 67% 的声明中（672/1,000；95% 置信区间：64–70%），前沿模型组未能达成一致——至少有一个模型与多数派结论相左，或者根本没有形成严格的多数派。具体分析如下：对于每一项声明，我们查看了五个前沿模型的结论并询问：是否至少有三个模型选择了相同的答案（严格多数）？如果是，剩余模型中有多少持不同意见？如果根本没有出现明确的多数派（结论分散在三个或四个不同的类别中），则该声明被归入“模型分歧，无多数派”一栏。

Most of these claims are unlikely to appear in any training corpus with a gold label attached — there’s no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to. We refer below to the “majority” and to “dissent from the majority.” A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. 这些声明大多不太可能出现在带有“金标准”标签的训练语料库中——没有可供模式匹配的权威答案，也没有可供锚定的基准排行榜。我们在下文中提到的“多数派”和“与多数派分歧”，并不意味着前沿模型的多数意见就是事实真相。多数派的结论有时是错误的，而个别持不同意见的模型有时反而是正确的。我们使用多数派作为衡量分歧的结构性参考点，而非正确性的替代指标。

(Table data omitted for brevity) (表格数据略)

Panel agreement: Krippendorff’s α (ordinal) = 0.639 (n=1000 claims, 5 raters). This indicates nontrivial but limited agreement: the models’ verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge. Ordinal α is the standard Krippendorff variant for an ordered categorical scale (True / Mostly True / Misleading / False). 专家组一致性： Krippendorff’s α（序数）= 0.639（n=1000 条声明，5 名评估者）。这表明存在非平凡但有限的一致性：模型的结论是有结构的而非随机的，但其一致性不足以将该专家组视为单一的可互换评判者。序数 α 是针对有序分类量表（真 / 基本为真 / 误导性 / 假）的标准 Krippendorff 变体。

Lower bound on model error: For each claim, exactly one of the four verdict buckets is the correct answer. If we assume the panel’s most popular bucket is the correct one — the most charitable assumption — the minimum number of models that picked a wrong verdict is: 模型错误的下限： 对于每一项声明，四个结论类别中只有一个是正确答案。如果我们假设专家组中最受欢迎的类别是正确的（这是最宽容的假设），那么选择错误结论的模型数量至少为：

≥1 model wrong on 67% of claims (any non-unanimous panel)
≥1 个模型在 67% 的声明中出错（任何非全票通过的专家组）
≥2 wrong on 45% of claims (3-2, 3-1-1, or no-majority splits)
≥2 个模型在 45% 的声明中出错（3-2、3-1-1 或无多数派的分歧）
≥3 wrong on 13% of claims (no bucket reaches a majority, so at most 2 can be right)
≥3 个模型在 13% 的声明中出错（没有类别达到多数，因此最多只有 2 个可能是对的）

2. Substantive vs nuance disagreement

2. 实质性分歧与细微差别分歧

On 34% of claims (343 / 1,000; 95% CI: 31–37%), at least two frontier models pick verdicts that are 2 or more buckets apart in our 4-bucket rubric — a disagreement that goes beyond calibration. Not every disagreement is equal. A “True” vs “Mostly True” split is a confidence-calibration shift. A “True” vs “False” split is a substantive disagreement about the answer. 在 34% 的声明中（343/1,000；95% 置信区间：31–37%），至少有两个前沿模型给出的结论在我们 4 级分类标准中相差 2 个或更多等级——这种分歧超出了校准范畴。并非所有的分歧都相同。“真”与“基本为真”的分歧属于置信度校准偏差，而“真”与“假”的分歧则是对答案本身的实质性分歧。

3. Model-vs-model agreement

3. 模型间的一致性

Highest peer agreement: Gemini 3 Pro × Gemini 3 Pro + Search (75%) — unsurprising, since they share a base model. Lowest: Claude Opus 4.7 × Gemini 3 Pro, Claude Opus 4.7 × Gemini 3 Pro + Search and Gemini 3 Pro × Sonar Pro (53%) — three pairs tie at the floor. 同行一致性最高的是：Gemini 3 Pro × Gemini 3 Pro + Search (75%)——这并不令人意外，因为它们共享同一个基础模型。最低的是：Claude Opus 4.7 × Gemini 3 Pro、Claude Opus 4.7 × Gemini 3 Pro + Search 以及 Gemini 3 Pro × Sonar Pro (53%)——这三对组合并列垫底。

4. Per-model behavior

4. 单个模型的表现

Two angles on the same five models: how each one distributes its verdicts (4.1), and how often each one’s verdict matches the strict majority of the other four (4.2). 从两个角度审视这五个模型：每个模型如何分布其结论（4.1），以及每个模型的结论与其余四个模型形成的严格多数派匹配的频率（4.2）。

4.1 Verdict distribution: Some models concentrate verdicts at the True/False poles; others distribute more broadly across the middle two buckets. This reflects model-level decision priors interacting with the specific claims — without ground truth, we can’t separate the two. 4.1 结论分布： 一些模型倾向于将结论集中在“真/假”两极；另一些则更广泛地分布在中间两个类别中。这反映了模型层面的决策先验与特定声明之间的相互作用——在没有事实真相的情况下，我们无法将两者区分开来。

4.2 Agreement with the rest of the panel: Across the five models, peer-majority agreement ranges from 69% to 81%. This is peer-alignment in this corpus, not correctness — no model is treated as ground truth here. 4.2 与专家组其余成员的一致性： 在这五个模型中，与同行多数派的一致性范围从 69% 到 81%。这是该语料库中的同行对齐度，而非正确性——在这里，没有任何模型被视为事实真相。