Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

语音智能体能应对双语客户吗?针对语码转换语音的前沿自动语音识别(ASR)基准测试

Introduction

引言

Over half of the world’s population speaks more than one language. And for many bilingual speakers, code-switching — seamlessly switching between languages, even mid-sentence — is a natural part of everyday communication. Whether in casual conversations, contact centers, or IT helpdesks, speakers fluidly adapt to whichever language feels most natural in the moment. 全球超过一半的人口会说不止一种语言。对于许多双语使用者来说,语码转换(即在语言之间无缝切换,甚至在句子中间切换)是日常交流的自然组成部分。无论是在日常对话、联络中心还是 IT 服务台,说话者都会根据当下感觉最自然的语言进行流畅转换。

Despite the prevalence of bilingual speakers across the world, there has been little work focused on how voice agents handle code-switched speech in enterprise settings. So, when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark and dataset to evaluate models. 尽管全球双语使用者非常普遍,但针对语音智能体如何在企业环境中处理语码转换语音的研究却很少。因此,当有客户询问我们的语音智能体如何为他们经常进行语码转换的双语客户群提供服务时,我们决定构建自己的基准测试和数据集来评估模型。

We focused on automatic speech recognition (ASR) — the first step in any voice agent pipeline — because transcription errors propagate forward into every downstream component. In enterprise settings, where a misrouted ticket or misunderstood policy question has real operational consequences, getting the transcript right is an especially important step of the voice agent pipeline. 我们专注于自动语音识别(ASR)——这是任何语音智能体流程的第一步——因为转录错误会传播到后续的每一个组件中。在企业环境中,工单路由错误或政策问题理解偏差会带来实际的运营后果,因此确保转录准确性是语音智能体流程中至关重要的一步。

Our benchmark covers four language pairs that were most relevant for our customer base: Spanish-English, French-English, Canadian French-English, and German-English. It uses the non-English language as the matrix framing, with English embedded at varying lengths. The data covers a wide range of Human Resources (HR) and IT Service management (ITSM) scenarios, including employee inquiries about benefits or payroll, and support requests such as password resets, VPN access, or device troubleshooting. 我们的基准测试涵盖了与我们客户群最相关的四种语言对:西班牙语-英语、法语-英语、加拿大法语-英语和德语-英语。它以非英语语言作为基础框架,并嵌入不同长度的英语。数据涵盖了广泛的人力资源(HR)和 IT 服务管理(ITSM)场景,包括员工关于福利或工资的咨询,以及密码重置、VPN 访问或设备故障排除等支持请求。

To measure how various models perform, we report three metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). We choose these metrics to capture both (1) the models’ exact accuracy in transcription, as well as (2) their ability to preserve the meaning of the utterance for downstream tasks. 为了衡量各种模型的表现,我们报告了三个指标:词错误率(WER)、语义词错误率(SWER)和答案错误率(AER)。我们选择这些指标是为了捕捉(1)模型在转录中的精确准确性,以及(2)它们为下游任务保留话语含义的能力。

We release our benchmark and data through our harness for evaluating voice models, AU-Harness. We also provide results from seven ASR systems, including some Large Audio Language Models (LALMs), frontier ASRs, and open-source ASRs. Our main finding is that the cost of codeswitching varies depending on the language-pair and model tested. ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro surface as the top models across metrics for the task. 我们通过用于评估语音模型的工具集 AU-Harness 发布了我们的基准测试和数据。我们还提供了七个 ASR 系统的结果,包括一些大型音频语言模型(LALM)、前沿 ASR 和开源 ASR。我们的主要发现是,语码转换的成本取决于所测试的语言对和模型。ElevenLabs Scribe V2、Gemini 3 Flash 和 Assembly AI Universal 3-Pro 在该任务的各项指标中表现最为出色。

The Benchmark Data Pipeline

基准测试数据流水线

We start with an internal corpus of IT support and HR interactions. To create each code-switched utterance, we begin with parallel user utterances in English and one of our four non-English languages, then filter for good code-switching candidates. We keep utterances between 12 and 40 words — short enough to be natural spoken turns, long enough to contain real switching opportunities. 我们从内部的 IT 支持和人力资源交互语料库开始。为了创建每一条语码转换话语,我们首先使用英语和四种非英语语言之一的平行用户话语,然后筛选出适合语码转换的候选内容。我们将话语长度保持在 12 到 40 个单词之间——既短到足以成为自然的口语轮次,又长到足以包含真实的切换机会。

We also exclude utterances where entities dominate — emails, phone numbers, IDs, or URLs that make text half-English by necessity rather than bilingual choice. Finally, we require at least three switchable content words — nouns, verbs, or adjectives that are not entities or product names — to give the generation model enough material to produce a meaningful code-switched version. 我们还排除了以实体为主的话语——例如电子邮件、电话号码、ID 或 URL,这些内容使文本因必要性而非双语选择而变成半英语。最后,我们要求至少有三个可切换的内容词(即非实体或产品名称的名词、动词或形容词),以便为生成模型提供足够的素材来产生有意义的语码转换版本。

From here, we tested various strategies for combining languages in a realistic way and ultimately selected a simple persona prompt sent to an LLM (OpenAI/GPT-5) to produce the code-switched text. We then used an LLM verbalization pass to convert the text into its spoken form and used ElevenLabs Multilingual V2 to synthesize the audio. Every utterance is then reviewed by an AI/NLP linguist who is a native speaker of the matrix language; flagged utterances are excluded or regenerated and re-reviewed. The final dataset has 259 Spanish-English records, 298 French-English records, 188 Canadian French-English records, and 173 German-English records. 在此基础上,我们测试了多种以现实方式组合语言的策略,最终选择向大语言模型(OpenAI/GPT-5)发送简单的角色提示词来生成语码转换文本。然后,我们使用大语言模型进行口语化处理,将文本转换为口语形式,并使用 ElevenLabs Multilingual V2 合成音频。每一条话语都由一位母语为基础语言的 AI/NLP 语言学家进行审核;被标记的话语会被排除、重新生成并再次审核。最终数据集包含 259 条西班牙语-英语记录、298 条法语-英语记录、188 条加拿大法语-英语记录和 173 条德语-英语记录。

Evaluation Methodology

评估方法

We report three metrics per model per language pair, chosen to capture transcription accuracy, meaning preservation, and downstream task performance: 我们针对每个模型和每个语言对报告三个指标,旨在捕捉转录准确性、含义保留度和下游任务表现:

  • Word Error Rate (WER). Along with overall WER per language pair, we report WER by individual language. 词错误率 (WER)。 除了每个语言对的总体 WER 外,我们还报告了各语言的单独 WER。
  • Semantic WER (SWER). This score represents the rate of errors that are judged as semantically meaningful. Our implementation is largely based on Pipecat’s STT benchmark, and we use Gemma-4-31B as our judge. 语义词错误率 (SWER)。 该分数代表被判定为具有语义影响的错误率。我们的实现主要基于 Pipecat 的 STT 基准测试,并使用 Gemma-4-31B 作为评判模型。
  • Answer Error Rate (AER). This metric directly captures whether transcription errors propagate into downstream failures. It is a question-answer metric that follows the methodology in Bhushan et al. (IISc/ARTPARK, arXiv 2507.16456). For each utterance, we generate three downstream comprehension questions and measure whether an LLM reading the ASR transcript can answer them correctly. 答案错误率 (AER)。 该指标直接捕捉转录错误是否会传播到下游导致失败。这是一个遵循 Bhushan 等人(IISc/ARTPARK, arXiv 2507.16456)方法论的问答指标。对于每一条话语,我们生成三个下游理解问题,并衡量阅读 ASR 转录文本的大语言模型是否能正确回答这些问题。

Findings

研究发现

We evaluated the following models: AssemblyAI / Universal 3-Pro, Deepgram / Nova 3 Multilang, ElevenLabs / Scribe V2, Google / Gemini 3 Flash, Mistral AI / Voxtral Small 24B-2507, Nvidia / Parakeet TDT 0.6b V3, OpenAI / Whisper Large V3 Turbo. 我们评估了以下模型:AssemblyAI / Universal 3-Pro, Deepgram / Nova 3 Multilang, ElevenLabs / Scribe V2, Google / Gemini 3 Flash, Mistral AI / Voxtral Small 24B-2507, Nvidia / Parakeet TDT 0.6b V3, OpenAI / Whisper Large V3 Turbo。

A. How well do models perform on our benchmark for codeswitching? A. 模型在我们的语码转换基准测试中表现如何?

We analyzed errors along two dimensions: 我们从两个维度分析了错误:

  • Word-level accuracy, measured through WER. WER is the standard approach: it aligns the ground truth transcript with the model’s output and quantifies the distance between them. Although it is simple and widely used, it can’t distinguish a minor spelling difference from a completely wrong word. 词级准确性,通过 WER 衡量。 WER 是标准方法:它将真实转录文本与模型输出对齐,并量化两者之间的距离。虽然它简单且被广泛使用,但它无法区分微小的拼写差异和完全错误的单词。
  • Semantic accuracy, captured through SWER and AER. SWER gives us a holistic view of utterance-level performance, though it reflects a judge model’s assessment rather than a direct downstream test. AER, by contrast, is a functional test: for each utterance, 语义准确性,通过 SWER 和 AER 捕捉。 SWER 为我们提供了话语级表现的整体视图,尽管它反映的是评判模型的评估,而非直接的下游测试。相比之下,AER 是一种功能性测试:对于每一条话语,