The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
“掷硬币”的裁判?LLM 作为裁判评估的可靠性与偏见研究
Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations.
摘要: “LLM 作为裁判”(LLM-as-a-Judge)目前已被广泛用于模型输出排名、奖励模型训练以及公共排行榜的构建,但其在多次运行中的可靠性仍缺乏充分的表征。我们使用两款 OpenAI 裁判模型(GPT-4o-mini 和 GPT-4.1-mini),针对涵盖 10 个类别的 29 项任务进行了重复的同质评估研究。每道题目均进行了 50 次成对比较试验和 50 次单点评分试验,并辅以温度参数和提示词敏感性消融实验。
Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19—0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise—pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference.
在不同裁判模型中,成对偏好平均有 13.6% 的概率发生翻转,其中 28% 的题目翻转率超过 20%,个别题目甚至高达 56%。GPT-4o-mini 还表现出显著的“首位偏见”(72% 的情况下倾向于选择 A,p = 0.024)。与此同时,平均单点评分差距很小(在 10 分制下为 0.19–0.36),且在总体上不具有统计学意义。这导致了成对评估与单点评分之间的脱节:即使裁判自身的标量评分并未提供有意义的质量差异证据,它们也经常会强行选出一个胜者。
Beyond within-judge instability, cross-judge agreement is only 76% ($\kappa = 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions.
除了裁判内部的不稳定性外,不同裁判之间的一致性仅为 76% ($\kappa = 0.51$);语义等价的提示词模板在 25% 的测试案例中改变了多数投票结果;确定性解码(deterministic decoding)虽然减少了不一致性,但未能将其消除。可靠性曲线分析表明,在我们的数据集中,平均需要 11 次重复试验才能使多数投票以 95% 的概率复现 50 次试验的参考结论,而对于高方差题目,这一数字则上升至 15 次。
These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.
这些发现表明,单次 LLM 裁判评估对于高风险评估任务来说往往噪声过大。因此,多次试验聚合、位置随机化以及明确的不确定性报告应成为行业标准做法。由于两款裁判模型均来自同一供应商,跨供应商的复现研究仍是下一步的重要工作。