The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

TTS-STT 飞轮：合成实体密集音频填补了印度语 ASR 在商业和开源系统中的性能鸿沟

Abstract: Niche-domain Indic ASR — digit strings, currency amounts, addresses, brand names, English/Indic codemix — is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16.

摘要： 针对小众领域的印度语自动语音识别（ASR）——如数字串、货币金额、地址、品牌名称以及英/印语代码混合——目前无论是开源的最先进（SOTA）系统还是商业系统，都未能提供良好的支持。在一个合成的实体密集型泰卢固语测试集（由合成系统预留）上，vasista22/whisper-telugu-large-v2（开源 SOTA）的实体命中率（EHR）仅为 0.027，而 Deepgram Nova-3（商业系统）为 0.16。

We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te.

我们通过一个自包含的 TTS<->STT 飞轮填补了这一差距：一个开源的印度语 TTS 流水线以低于 50 美元的边际成本合成了约 22,000 条实体密集的印-英代码混合语音，并在 vasista22 基础上进行 LoRA 微调，在预留测试集上实现了 0.473 的 EHR（是开源 SOTA 的 17 倍，商业系统的 3 倍），同时在 FLEURS-Te 数据集上的朗读文本回归误差（WER）仅增加了 6.6 个百分点。

Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly.

跨语言表现方面：印地语模型（beta-Hi）达到 0.337（是 vasista22 的 7 倍），泰米尔语模型（beta-Ta）达到 0.543（分别是 vasista22 和 Deepgram 的 22 倍）；但在 Deepgram 已具备显著实体覆盖能力的印地语上，该飞轮的表现不及商业系统。所有三个 beta 模型均未达到预注册的 EHR 目标（泰卢固语为 0.75，印地语/泰米尔语为 0.65）；我们在此如实报告。

A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus.

一项由母语人士录制的完整性检查（n=20 泰卢固语）证实了该模型向真实语音的迁移能力（beta-Te 在母语语音上的 EHR 为 0.516，而合成语音上为 0.473）。一项 EDSA（实体密集合成音频）隔离消融实验（仅在 FLEURS-Te 上进行 LoRA 微调）在同一预留集上仅获得 0.020 的 EHR，这表明约 100% 的性能提升归功于 EDSA 语料库。

We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

此外，我们还报告了一个语言条件发现：原生 Whisper-large-v3 存在泰卢固语特有的脚本崩溃（Script Collapse）问题（SFR 为 0.46-0.71），而针对特定语言的 LoRA 可以纠正这一问题（SFR 提升至 0.81-0.97）；但该方案在印地语和泰米尔语上并不适用，因为这些语言的原生 SFR 已达到 0.98 或更高。代码、预留集、预测结果、EDSA 语料库及实体词典均已开源发布。