SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx：临床语音人工智能的多任务基准测试

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess.

语音通过同时调动神经、运动、呼吸和发声系统，为健康状况提供了一个独特且信息丰富的窗口。目前的临床语音人工智能方法大多通过针对特定病症的孤立研究取得进展，这导致研究结果难以比较，且难以评估其泛化能力。

We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation.

我们推出了 SpeechDx，这是一个用于临床语音人工智能的大规模基准测试，涵盖了 12 个数据集和 27 项跨越不同健康状况的任务。为了实现跨共享临床机制的评估，SpeechDx 根据语音产生的受损阶段对任务进行了结构化分类：概念化、形式化和发音。

The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer.

该基准测试通过包含标注数据有限的任务，并评估跨多个数据集的相同健康状况，来测试模型的泛化能力，从而将具有临床意义的模式与数据集伪影区分开来。我们系统地评估了 12 个最先进的音频编码器在所有任务以及零样本跨条件迁移下的表现。

Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations.

结果表明，大规模语音模型代表了最强的整体基准，领域特定模型仅在高度匹配的任务上能提升性能，且目前没有任何一种表征能够在整个临床语音领域实现可靠的泛化。SpeechDx 建立了一个共享的评估框架，用于追踪通向通用临床语音表征的进展。