Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

介绍 FFASR 排行榜:真实世界中的自动语音识别(ASR)基准测试

🚀 First open far-field ASR benchmark: community-driven evaluation across 14 simulated rooms, validated against real-world measurements: https://huggingface.co/spaces/treble-technologies/ffasr 🚀 首个开放式远场 ASR 基准测试:基于 14 个模拟房间的社区驱动评估,并已通过真实世界测量进行验证:https://huggingface.co/spaces/treble-technologies/ffasr

📉 The gap is real and it is large: across all submitted models, far-field WER at low SNR is consistently several times higher than near-field WER on the same speech content. 📉 差距真实存在且巨大:在所有提交的模型中,低信噪比(SNR)下的远场词错误率(WER)始终比相同语音内容下的近场 WER 高出数倍。

🔬 Methodology you can trust: hybrid wave-based simulation, sim-to-real validation, moving-source splits in beta, held-out audio, and standardized evaluation hardware across all submissions. 🔬 值得信赖的方法论:采用混合波动模拟、仿真到现实(sim-to-real)验证、测试中的移动声源分段、留出音频(held-out audio),以及所有提交内容均使用标准化评估硬件。

⚡ Accuracy and speed together: the Pareto front plots average WER against RTFx so you can evaluate the tradeoff that is right for your deployment. ⚡ 兼顾准确性与速度:帕累托前沿(Pareto front)绘制了平均 WER 与实时因子(RTFx)的关系图,以便您评估最适合您部署方案的权衡。

👀 More is coming: multi-talker scenarios, microphone array support, and echo cancellation are on the roadmap. 👀 更多功能即将推出:多说话人场景、麦克风阵列支持和回声消除已列入路线图。

The gap between benchmark performance and real-world deployment is one of the more persistent frustrations in ASR development. Models that score well on standard evaluations often behave differently once real room acoustics are involved: reverberation, background noise, microphone distance. The complex interactions between these factors affect performance in ways that clean-speech benchmarks do not capture. The FFASR Leaderboard is our attempt to quantify that gap. 基准测试性能与实际部署之间的差距是 ASR 开发中长期存在的痛点之一。在标准评估中得分很高的模型,一旦涉及真实的房间声学环境(如混响、背景噪音、麦克风距离),表现往往会大打折扣。这些因素之间复杂的相互作用以纯净语音基准测试无法捕捉的方式影响着性能。FFASR 排行榜正是我们量化这一差距的尝试。

Treble Technologies and Hugging Face are launching the Far-Field ASR (FFASR) Leaderboard, the first open, community-driven benchmark designed to evaluate ASR models under realistic far-field acoustic conditions. It is live now, and we are inviting the community to submit models, explore the results, and help shape what comes next. Treble Technologies 和 Hugging Face 联合推出了远场 ASR (FFASR) 排行榜,这是首个旨在评估真实远场声学条件下 ASR 模型的开放式社区驱动基准测试。该排行榜现已上线,我们诚邀社区提交模型、探索结果,并共同塑造未来的发展方向。

Why far-field evaluation matters

为什么远场评估至关重要

Voice interfaces have expanded well beyond the headset and the smartphone. AI voice agents, conference room transcription, in-car assistants, humanoid robots, smart glasses, and hands-free tools are all seeing rapid adoption. What they have in common is that they operate in acoustically complex environments: reverberation, background noise, overlapping sounds, and a microphone that may be anywhere from one to several meters from the speaker. 语音接口的应用早已超越了耳机和智能手机。AI 语音助手、会议室转录、车载助手、人形机器人、智能眼镜和免提工具都在迅速普及。它们的共同点在于运行环境声学复杂:存在混响、背景噪音、重叠声音,且麦克风距离说话人可能在一到数米不等。

The dominant ASR evaluation paradigm has not caught up with this reality. Clean, close-microphone benchmarks remain the standard, and while they are useful for measuring core recognition quality, they do not predict far-field performance. A model that performs well on LibriSpeech or other near-field sets may degrade substantially once real room acoustics enter the picture. 主流的 ASR 评估范式尚未跟上这一现实。纯净的近场麦克风基准测试仍然是标准,虽然它们对于衡量核心识别质量很有用,但无法预测远场性能。在 LibriSpeech 或其他近场数据集上表现良好的模型,一旦进入真实的房间声学环境,性能可能会大幅下降。

While there have been several research efforts around far-field and noisy speech evaluation — including CHiME, URGENT, and NOIZEUS — the community has not had a standardized, open way to measure that degradation consistently across models in a continuously updated leaderboard format. That is what FFASR is built for. 尽管已有针对远场和噪声语音评估的研究工作(包括 CHiME、URGENT 和 NOIZEUS),但社区一直缺乏一种标准化的、开放的方式,以持续更新的排行榜形式在不同模型间一致地衡量这种性能衰减。这正是 FFASR 建立的目的。

A major challenge of far-field evaluation is the availability of data. Collecting far-field recordings across a representative range of room types, microphone distances, and noise conditions at scale is prohibitively expensive with physical measurements alone. Simulation makes it possible to cover that space systematically and to extend coverage over time without a corresponding increase in measurement cost. 远场评估的一个主要挑战是数据的获取。仅通过物理测量来大规模收集涵盖各种代表性房间类型、麦克风距离和噪声条件的远场录音,成本高得令人望而却步。模拟技术使得系统性地覆盖这些空间成为可能,并能在不增加测量成本的情况下随时间扩展覆盖范围。

Another goal of FFASR is to encourage the development of models that are explicitly robust to these conditions. Leaderboards have historically been effective at directing research effort. By making far-field performance visible and comparable, we hope to raise the priority of real-world acoustic robustness across the field. FFASR 的另一个目标是鼓励开发对这些条件具有明确鲁棒性的模型。从历史上看,排行榜在引导研究方向方面一直很有效。通过使远场性能变得可见且可比较,我们希望提高整个领域对真实世界声学鲁棒性的重视程度。

How the benchmark is constructed

基准测试是如何构建的

The FFASR Leaderboard evaluates models across nine conditions. The four that determine the primary ranking score are (as of 22 June 2026): FFASR 排行榜在九种条件下评估模型。决定主要排名分数的四种条件是(截至 2026 年 6 月 22 日):

  • Near-field (dry) — clean speech measured in an anechoic chamber (similar to Librispeech but with minimal reverberation)
  • 近场(干声) — 在消声室中测量的纯净语音(类似于 Librispeech,但混响极小)
  • Far-field high SNR (above 14 dB)
  • 远场高信噪比(高于 14 dB)
  • Far-field mid SNR (8 to 12 dB)
  • 远场中信噪比(8 到 12 dB)
  • Far-field low SNR (below 6 dB)
  • 远场低信噪比(低于 6 dB)

To give a sense of what these conditions actually sound like, the samples below let you hear the same speech utterance as dry anechoic audio, then convolved with a room impulse response, and finally with noise added at each SNR tier. The difference between the dry recording and the low-SNR far-field condition is a reasonable proxy for the scale of the problem the leaderboard is measuring. 为了让您直观感受这些条件下的实际听感,下方的样本展示了同一段语音在干声消声状态、与房间脉冲响应卷积后,以及在各信噪比层级添加噪声后的效果。干声录音与低信噪比远场条件之间的差异,可以很好地反映该排行榜所衡量问题的规模。

Two additional columns, Lab Measured and Lab Simulated, serve as a sim-to-real validation track. The leaderboard also includes moving-source splits, currently in beta, which evaluate models against audio where the speaker is in motion rather than stationary. This condition reflects use cases such as humanoid robots, in-car speech, and mobile voice assistants where the acoustic geometry between speaker and microphone changes continuously. 另外两列“实验室测量(Lab Measured)”和“实验室模拟(Lab Simulated)”作为仿真到现实的验证轨道。排行榜还包括目前处于测试阶段的“移动声源分段”,用于评估说话人处于运动状态而非静止状态下的音频模型。这种情况反映了人形机器人、车载语音和移动语音助手等使用场景,在这些场景中,说话人与麦克风之间的声学几何关系会持续变化。

The acoustic data is generated with Treble’s hybrid simulation engine, which combines a wave-based solver at low to mid frequencies with geometrical-acoustics modeling at higher frequencies. This approach captures physical phenomena that simpler simulation methods often miss: diffraction, scattering, interference, and modal behavior. The result is simulated data that closely matches measured acoustic conditions, which the Lab Measured and Lab Simulated columns confirm directly by running the same evaluation on both. 声学数据由 Treble 的混合模拟引擎生成,该引擎结合了中低频的基于波的求解器和高频的几何声学建模。这种方法捕捉到了简单模拟方法经常遗漏的物理现象:衍射、散射、干涉和模态行为。其结果是模拟数据与测量的声学条件高度吻合,这一点通过在“实验室测量”和“实验室模拟”两列中运行相同的评估得到了直接证实。

Fourteen fully furnished rooms are included in the benchmark, ranging from 20 to 470 m³ and covering bathrooms, living rooms with hallways, offices, classrooms, and restaurant spaces. Each acoustic scene contains one target speaker, recorded in an anechoic chamber to avoid reverberation artifacts from the recording environment, and up to three no… 基准测试包含了 14 个设施齐全的房间,体积从 20 到 470 立方米不等,涵盖了浴室、带走廊的客厅、办公室、教室和餐厅空间。每个声学场景包含一个目标说话人,录音在消声室中进行以避免录音环境带来的混响伪影,并包含最多三个…