Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Metric Match：一种评估大模型判官可靠性的子集选择方法

Abstract: LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters — a property that itself depends on costly human annotations.

摘要： 大模型判官（LLM judges）被用于减少在评估开放式文本生成时对昂贵人工劳动的需求。然而，这些判官的可靠性在很大程度上取决于它们与人类评分者的一致性——而这一属性本身又依赖于昂贵的人工标注。

In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels.

在这项工作中，我们开发了一种方法（Metric Match），用于从有限的标注中估算大模型判官基于相关性的可靠性指标。Metric Match 通过选择一个样本子集进行人工标注，使得该子集在获取的合成标签方面与总体可靠性指标相匹配。

We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%.

我们通过实证表明，在四个不同的相关性指标和 15 个数据集上，Metric Match 相比随机子集选择取得了 0.838 的胜率，平均估计误差降低了 18.7%，并将标注需求减少了 32.5%。

We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match.

我们提供了一个成本模型，并重点介绍了一个医疗案例研究，在该案例中，与随机选择专家标注相比，我们的方法节省了 1,041.67 美元。此外，我们将任务从可靠性估算转向了可靠性分类（即判断给定的判官是否达到部署阈值），Metric Match 在此任务中同样优于随机选择。

All project code is publicly available, and we additionally provide an installable package for ease of use.

该项目的所有代码均已公开，此外我们还提供了一个可安装的软件包，以便于使用。