The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

“掷硬币”的裁判？LLM 作为裁判评估的可靠性与偏见研究

Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanning 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), with 50 pairwise trials and 50 pointwise trials per question, supplemented by temperature and prompt-sensitivity ablations.

摘要： “LLM 作为裁判”（LLM-as-a-Judge）目前已被广泛用于模型输出排名、奖励模型训练以及公共排行榜的构建，但其在多次运行中的可靠性仍缺乏充分的表征。我们使用两款 OpenAI 裁判模型（GPT-4o-mini 和 GPT-4.1-mini），针对涵盖 10 个类别的 29 项任务进行了重复的同质评估研究。每道题目均进行了 50 次成对比较试验和 50 次单点评分试验，并辅以温度参数和提示词敏感性消融实验。

Across judges, pairwise preferences flip on average 13.6% of the time, with 28% of questions exceeding a 20% flip rate and one question reaching 56%. GPT-4o-mini also exhibits a significant first-position bias (72% A-majority, p = 0.024). At the same time, mean pointwise score gaps are small (0.19—0.36 on a 10-point scale) and not statistically significant in aggregate, producing a pairwise—pointwise gap: judges frequently choose a winner even when their own scalar scores provide little evidence of a meaningful quality difference.

在不同裁判模型中，成对偏好平均有 13.6% 的概率发生翻转，其中 28% 的题目翻转率超过 20%，个别题目甚至高达 56%。GPT-4o-mini 还表现出显著的“首位偏见”（72% 的情况下倾向于选择 A，p = 0.024）。与此同时，平均单点评分差距很小（在 10 分制下为 0.19–0.36），且在总体上不具有统计学意义。这导致了成对评估与单点评分之间的脱节：即使裁判自身的标量评分并未提供有意义的质量差异证据，它们也经常会强行选出一个胜者。

Beyond within-judge instability, cross-judge agreement is only 76% ($\kappa = 0.51$), semantically equivalent prompt templates change majority outcomes in 25% of tested cases, and deterministic decoding reduces but does not eliminate inconsistency. A reliability curve analysis shows that, in our dataset, 11 repeated trials are needed for a majority vote to recover the 50-trial reference verdict with 95% probability on average, rising to 15 for high-variance questions.

除了裁判内部的不稳定性外，不同裁判之间的一致性仅为 76% ($\kappa = 0.51$)；语义等价的提示词模板在 25% 的测试案例中改变了多数投票结果；确定性解码（deterministic decoding）虽然减少了不一致性，但未能将其消除。可靠性曲线分析表明，在我们的数据集中，平均需要 11 次重复试验才能使多数投票以 95% 的概率复现 50 次试验的参考结论，而对于高方差题目，这一数字则上升至 15 次。

These findings suggest that single-trial LLM judging is often too noisy for high-stakes evaluation, and that multi-trial aggregation, position randomization, and explicit uncertainty reporting should be standard practice. Because both judges are from a single provider, cross-provider replication remains an important next step.

这些发现表明，单次 LLM 裁判评估对于高风险评估任务来说往往噪声过大。因此，多次试验聚合、位置随机化以及明确的不确定性报告应成为行业标准做法。由于两款裁判模型均来自同一供应商，跨供应商的复现研究仍是下一步的重要工作。