Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs
Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs
大语言模型归因指标具有迁移性吗?跨数据集与构建的检索增强生成评估审计
Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers — lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) — across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct.
摘要: 在大语言模型(LLM)检索增强生成(RAG)的实践中,归因自动评估指标常被视为可互换的。我们审计了八种自动评分器——包括词法、嵌入和 BERTScore 基准,以及经过蕴含/基础训练的模型(Clean 和 FEVER NLI,以及检查器 MiniCheck)——并跨越了三种评估构建(来源/主题相关性、生成答案归因、事实核查蕴含)。我们旨在探究是否存在具有迁移性的评分器,即在多数据集构建的每一个数据集上,其表现均能保持在最佳审计评分器 95% 的置信区间内。
In the construct with the most multi-dataset human-labeled coverage — generated-answer attribution (AttributionBench’s four source datasets, n = 1,610, with independent HAGRID, n = 2,150) — none does: the per-dataset metric rankings invert (Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0.90) collapses to AUROC 0.53 (chance) on long-form LFQA, where BERTScore wins (0.91); the flip is not a length or truncation artifact.
在人工标注覆盖范围最广的构建——生成答案归因(包含 AttributionBench 的四个源数据集,n=1,610,以及独立的 HAGRID 数据集,n=2,150)中,没有任何评分器表现出迁移性:各数据集的指标排名出现倒置(AttributedQA 与 LFQA 相比,Kendall tau = -0.64,p = 0.031)。此外,在短文本声明 AttributedQA 上表现最佳的现成 NLI 评分器(AUROC 0.90),在长文本 LFQA 上表现骤降至 0.53(仅相当于随机猜测),而此时 BERTScore 表现最优(0.91);这种反转并非由文本长度或截断引起。
This instability has a concrete decision cost: a naive “best-on-average” rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0.172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others.
这种不稳定性带来了具体的决策成本:选择评估器时采用简单的“平均表现最佳”规则在“留一法”交叉验证中失效(平均留出遗憾值为 0.172 AUROC,甚至不如固定使用某一个评分器)。因此,指标的选择必须在目标数据集上进行验证,而不能简单地从其他数据集的学习结果中推导。
A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic — relocating, not removing, the validation burden.
基于提示词(Prompt-based)的 LLM 裁判虽然避免了自动评分器所面临的随机水平崩溃问题(在 LFQA 上未出现崩溃),但它并非在所有情况下都是最优的,且成本高出约 100 倍,并具有非确定性——这只是转移了验证负担,而非消除了它。