Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

为 Open ASR 排行榜添加“刷榜”防御机制

“When a measure becomes a target, it ceases to be a good measure.” (Goodhart’s Law) “当一个指标变成目标时,它就不再是一个好的指标了。”(古德哈特定律)

TLDR: Appen Inc. and DataoceanAI have provided high-quality English ASR datasets covering scripted and conversational speech over multiple accents. To prevent potential risks of benchmaxxing or test-set contamination, we will keep these datasets private for a high-quality measure of performance on multiple tasks. We’re not updating the average WER at this time: by default, the leaderboard’s Average WER remains computed on public datasets only. You can optionally include the private datasets using the toggle to see their impact 👀 简而言之:Appen Inc. 和 DataoceanAI 提供了高质量的英语自动语音识别(ASR)数据集,涵盖了多种口音的脚本化和对话式语音。为了防止潜在的“刷榜”(benchmaxxing)或测试集污染风险,我们将保持这些数据集的私有性,以作为衡量多任务性能的高质量标准。目前我们不会更新平均词错误率(WER):默认情况下,排行榜的平均 WER 仍仅基于公开数据集计算。你可以通过切换开关选择包含这些私有数据集,以查看它们的影响 👀

Since its launch in September 2023, the Open ASR Leaderboard has been visited over 710K times. We’re blown away by the community’s interest and motivation to keep pushing speech recognition 🗣️ 自 2023 年 9 月推出以来,Open ASR 排行榜的访问量已超过 71 万次。社区对推动语音识别技术发展的热情和动力令我们深感震撼 🗣️

Two words sum up the objectives (but also challenges) in maintaining a benchmark like the Open ASR Leaderboard: 两个词概括了维护像 Open ASR 排行榜这样的基准测试的目标(以及挑战):

Standardization: models can have different conventions for their usage and outputs, e.g. with/without punctuation and casing. Datasets have the same challenges and can be structured differently. To this end, all test sets have been gathered into a single dataset on the Hub for easy access and previewing. Moreover, to standardize model outputs and dataset transcripts, we use a normalizer that (among other things) removes punctuation and casing, and maps to American spelling. It is based on the normalizer of Whisper. 标准化:模型在使用和输出方面可能遵循不同的惯例,例如是否包含标点符号和大小写。数据集也面临同样的挑战,且结构可能各不相同。为此,我们将所有测试集整合到了 Hub 上的一个单一数据集中,以便于访问和预览。此外,为了标准化模型输出和数据集转录,我们使用了一个归一化工具,该工具(除其他功能外)会去除标点符号和大小写,并将其映射为美式拼写。它基于 Whisper 的归一化工具。

Openness: the UI code and evaluation scripts are open-sourced. This has helped not only to incorporate new models, but also to improve the quality of the evaluation procedure through community feedback and contributions. 开放性:UI 代码和评估脚本均已开源。这不仅有助于引入新模型,还通过社区反馈和贡献提高了评估流程的质量。

Standardization and openness are essential for meaningful benchmarking, but they also make benchmarks more susceptible to benchmark-specific optimization (“benchmaxxing”), where models improve leaderboard performance without corresponding gains in real-world robustness. As models and use cases evolve, the Open ASR Leaderboard will continue incorporating high-quality datasets and new evaluation settings to better reflect real-world performance and improve robustness against benchmark-specific optimization. 标准化和开放性对于有意义的基准测试至关重要,但它们也使基准测试更容易受到针对性优化(即“刷榜”)的影响,即模型在排行榜上提升了性能,但在现实世界中的鲁棒性却没有相应提高。随着模型和用例的发展,Open ASR 排行榜将继续整合高质量数据集和新的评估设置,以更好地反映现实性能,并提高对“刷榜”行为的防御能力。

As discussed in our report, there is no single “catch-all” ASR model: some perform better on American English, others on diverse accents and multilingual settings, while others are optimized for speed or conversational audio. Different applications also prioritize different capabilities, so a model that performs less well on one dimension is not necessarily a worse model overall. The goal of the Open ASR Leaderboard is to capture these nuances and provide a more holistic view of ASR performance. 正如我们在报告中所讨论的,不存在单一的“万能”ASR 模型:有些模型在美式英语上表现更好,有些在多种口音和多语言环境下表现更佳,而另一些则针对速度或对话音频进行了优化。不同的应用场景对能力的需求也不同,因此在某个维度上表现稍差的模型并不一定就是整体较差的模型。Open ASR 排行榜的目标是捕捉这些细微差别,并提供更全面的 ASR 性能视图。

New high-quality, private datasets 全新的高质量私有数据集

To this end, we have worked with Appen Inc. and DataoceanAI to curate high-quality datasets for ASR benchmarking. Below is some information on the various splits. 为此,我们与 Appen Inc. 和 DataoceanAI 合作,策划了用于 ASR 基准测试的高质量数据集。以下是关于各个数据子集的一些信息。

(Table omitted for brevity) (表格内容略)

While private datasets may sound contrary to the spirit of openness, we believe that incorporating such datasets will increase the trustworthiness of the Open ASR Leaderboard, as they are less likely to be exploited for benchmaxxing, whether by model developers who explicitly use the public test sets or who try to find training data that closely resembles a particular dataset to boost their score in the macroaverage. With these datasets, we can also provide targeted metrics to highlight gaps and biases between controlled and often saturated settings (scripted, American accent) and more nuanced conditions (conversational and non-American accents). 虽然私有数据集听起来可能与开放精神相悖,但我们认为引入这些数据集将提高 Open ASR 排行榜的可信度,因为它们不太可能被用于“刷榜”——无论是通过显式使用公开测试集的模型开发者,还是试图寻找与特定数据集高度相似的训练数据以提高宏观平均分数的开发者。有了这些数据集,我们还可以提供有针对性的指标,以突出受控且通常已饱和的环境(脚本化、美式口音)与更复杂的条件(对话式、非美式口音)之间的差距和偏差。

Below is a screenshot of the new “Private data” tab. Below is how each column is computed. “Average WER” computes a macroaverage of the data provider averages, so that they are weighted equally. “Avg Scripted” performs a macroaverage of all scripted datasets. “Avg Conversational” performs a macroaverage of all conversational datasets. “Avg US” performs a macroaverage of all datasets with American accents. “Avg non-US” performs a macroaverage of all datasets with non-American accents. We intentionally do not provide a score on each split, to avoid model developers from boosting their score with a specific data provider or accent. 以下是新的“私有数据”选项卡的截图。以下是各列的计算方式。“平均 WER”计算数据提供商平均值的宏观平均值,以确保它们被同等加权。“脚本平均值”对所有脚本化数据集进行宏观平均。“对话平均值”对所有对话式数据集进行宏观平均。“美式平均值”对所有美式口音数据集进行宏观平均。“非美式平均值”对所有非美式口音数据集进行宏观平均。我们特意不在每个子集上提供分数,以避免模型开发者通过特定的数据提供商或口音来提升分数。

How can I evaluate my model on this data? Get your model on the Open ASR Leaderboard, and we’ll run the evaluation! 我该如何用这些数据评估我的模型?将你的模型提交到 Open ASR 排行榜,我们将为你运行评估!