Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

以人为本：通过人类偏好对齐实现高效的大型音频模型（LAM）评估

The rapid proliferation of large audio models (LAMs) demands efficient approaches for model comparison, yet comprehensive benchmarks are costly. To fill this gap, we investigate whether minimal subsets can reliably evaluate LAMs while reducing costs and data redundancy. 大型音频模型（LAM）的快速激增对模型比较的高效方法提出了需求，然而全面的基准测试成本高昂。为了填补这一空白，我们研究了是否可以通过极小的子集来可靠地评估 LAM，同时降低成本并减少数据冗余。

Analyzing 10 subset selection methods with 18 audio models across 40 tasks covering major LAM evaluation dimensions, we show that subsets of just 50 examples (0.3% of data) can achieve over 0.93 Pearson correlation with full benchmark scores. 通过分析涵盖 LAM 主要评估维度的 40 项任务中的 18 个音频模型，我们测试了 10 种子集选择方法。结果表明，仅包含 50 个样本（占总数据的 0.3%）的子集，其评估结果与完整基准测试得分之间的皮尔逊相关系数（Pearson correlation）可超过 0.93。

To understand how well these scores align with what practitioners ultimately care about, user satisfaction, we collect 776 human preference ratings from realistic voice assistant conversations, finding that both subsets and full benchmark achieve only 0.85 correlation with human. 为了了解这些分数与从业者最终关心的用户满意度之间的契合程度，我们从真实的语音助手对话中收集了 776 条人类偏好评分。研究发现，子集和完整基准测试与人类偏好的相关性仅为 0.85。

To better predict preferences, we trained regression models on these selected subsets, achieving 0.98 correlation — outperforming regression models trained on both random subsets and the full benchmark. This demonstrates that in regression modeling, well-curated subsets outpredict the full benchmark, showing quality over quantity. 为了更好地预测偏好，我们在这些选定的子集上训练了回归模型，实现了 0.98 的相关性——优于在随机子集和完整基准测试上训练的回归模型。这证明在回归建模中，精心挑选的子集比完整基准测试具有更强的预测能力，体现了“质量胜于数量”的原则。

We open-source these regression-weighted subsets as the HUMANS benchmark, an efficient proxy for LAM evaluation that captures both benchmark performance and user preferences. 我们将这些经过回归加权的子集开源，命名为 HUMANS 基准测试。这是一个高效的 LAM 评估代理，能够同时捕捉基准测试性能和用户偏好。