Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

针对语码转换语音的商业自动语音识别（ASR）系统基准测试：阿拉伯语、波斯语和德语

Abstract: Code-switching — the natural alternation between two languages within a single utterance — represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR benchmarks predominantly evaluate clean, monolingual audio and report a single Word Error Rate (WER) figure that tells practitioners little about real-world multilingual performance.

摘要： 语码转换（Code-switching）——即在单次话语中自然地在两种语言之间切换——是自动语音识别（ASR）中最具挑战性且研究不足的场景之一。现有的商业 ASR 基准测试主要评估干净的单语音频，并报告单一的词错误率（WER），这对于从业者了解实际的多语言性能帮助甚微。

We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic—English, Saudi Arabic (Najdi/Hijazi)—English, Persian (Farsi)—English, and German—English. Each dataset comprises 300 samples selected by a two-stage pipeline: a heuristic filter scoring transcripts on five structural code-switching signals, followed by a GPT-4o and Gemini 1.5 Pro ensemble scoring candidates across six linguistic dimensions. This pipeline reduces LLM scoring costs by approximately 91% relative to exhaustive scoring.

我们提出了一个基准测试，评估了五家商业 ASR 提供商在四种语言对上的表现：埃及阿拉伯语-英语、沙特阿拉伯语（纳吉迪/希贾兹方言）-英语、波斯语（法尔西语）-英语以及德语-英语。每个数据集包含 300 个样本，这些样本通过一个两阶段流程筛选：首先通过启发式过滤器根据五种结构性语码转换信号对转录内容进行评分，随后由 GPT-4o 和 Gemini 1.5 Pro 组成的集成模型在六个语言维度上对候选内容进行评分。与穷举式评分相比，该流程将大语言模型（LLM）的评分成本降低了约 91%。

We evaluate the systems on both WER and BERTScore, arguing that BERTScore is a more reliable metric for Arabic and Persian pairs where transliteration variance causes WER to penalise semantically correct transcriptions. ElevenLabs Scribe v2 achieves the lowest WER across all four language pairs (13.2% overall; 13.1% on Egyptian Arabic) and leads on BERTScore (0.936 overall).

我们同时使用 WER 和 BERTScore 对系统进行了评估，并指出对于阿拉伯语和波斯语对而言，BERTScore 是更可靠的指标，因为在这些语言中，音译差异会导致 WER 对语义正确的转录内容进行惩罚。ElevenLabs Scribe v2 在所有四个语言对中均实现了最低的 WER（总体为 13.2%；埃及阿拉伯语为 13.1%），并在 BERTScore 上处于领先地位（总体为 0.936）。

We further demonstrate that difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and that BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The benchmarking dataset is publicly available at this https URL.

我们进一步证明，难度分层分析揭示了被总体平均值所掩盖的性能差距，并且 BERT 嵌入投影证实了尽管存在表层书写差异，参考文本与假设文本之间仍具有语义上的接近性。该基准测试数据集已在指定网址公开。