Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

大语言模型归因指标具有迁移性吗？跨数据集与构建的检索增强生成评估审计

Abstract: Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers — lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) — across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct.

摘要： 在大语言模型（LLM）检索增强生成（RAG）的实践中，归因自动评估指标常被视为可互换的。我们审计了八种自动评分器——包括词法、嵌入和 BERTScore 基准，以及经过蕴含/基础训练的模型（Clean 和 FEVER NLI，以及检查器 MiniCheck）——并跨越了三种评估构建（来源/主题相关性、生成答案归因、事实核查蕴含）。我们旨在探究是否存在具有迁移性的评分器，即在多数据集构建的每一个数据集上，其表现均能保持在最佳审计评分器 95% 的置信区间内。

In the construct with the most multi-dataset human-labeled coverage — generated-answer attribution (AttributionBench’s four source datasets, n = 1,610, with independent HAGRID, n = 2,150) — none does: the per-dataset metric rankings invert (Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0.90) collapses to AUROC 0.53 (chance) on long-form LFQA, where BERTScore wins (0.91); the flip is not a length or truncation artifact.

在人工标注覆盖范围最广的构建——生成答案归因（包含 AttributionBench 的四个源数据集，n=1,610，以及独立的 HAGRID 数据集，n=2,150）中，没有任何评分器表现出迁移性：各数据集的指标排名出现倒置（AttributedQA 与 LFQA 相比，Kendall tau = -0.64，p = 0.031）。此外，在短文本声明 AttributedQA 上表现最佳的现成 NLI 评分器（AUROC 0.90），在长文本 LFQA 上表现骤降至 0.53（仅相当于随机猜测），而此时 BERTScore 表现最优（0.91）；这种反转并非由文本长度或截断引起。

This instability has a concrete decision cost: a naive “best-on-average” rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0.172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others.

这种不稳定性带来了具体的决策成本：选择评估器时采用简单的“平均表现最佳”规则在“留一法”交叉验证中失效（平均留出遗憾值为 0.172 AUROC，甚至不如固定使用某一个评分器）。因此，指标的选择必须在目标数据集上进行验证，而不能简单地从其他数据集的学习结果中推导。

A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic — relocating, not removing, the validation burden.

基于提示词（Prompt-based）的 LLM 裁判虽然避免了自动评分器所面临的随机水平崩溃问题（在 LFQA 上未出现崩溃），但它并非在所有情况下都是最优的，且成本高出约 100 倍，并具有非确定性——这只是转移了验证负担，而非消除了它。