Your Multimodal Speech Model Says I Have a Face for Radio

你的多模态语音模型说我长了一张“适合广播”的脸

Abstract: As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to audio-visual data for noise mitigation and multimodal subtitling.

摘要： 随着大型神经模型在语言任务上的表现日益精进，研究人员正越来越多地构建能够处理更多数据模态的多模态及全模态模型。一个典型的例子是将语音识别模型扩展到视听数据，以实现降噪和多模态字幕生成。

While performance and bias have been studied extensively in the single-modality regime, it is unknown how new modalities affect this, even though they produce biases in humans. We therefore propose the first bias evaluation of multimodal speech recognition, where we create videos pairing different faces with the same audio, and measure changes in speech transcription accuracy.

尽管在单模态领域，性能和偏见问题已得到广泛研究，但人们尚不清楚新模态如何影响这些问题，尽管这些模态在人类身上确实会产生偏见。因此，我们提出了针对多模态语音识别的首个偏见评估方案：通过制作将不同面孔与相同音频配对的视频，来测量语音转录准确性的变化。

We find large quality-of-service differences across mWhisper-Flamingo and Gemini models, with drops of up to 4.05 word error rate points, across self-declared gender, ethnicity, and their intersection. Our findings point to a priority for developers to evaluate, fix, and communicate such limitations, as providing more signals through additional modalities is not necessarily better, and may even lead to biased outcomes.

我们发现 mWhisper-Flamingo 和 Gemini 模型在服务质量上存在巨大差异，在基于自我声明的性别、种族及其交叉维度上，词错误率（WER）最高下降了 4.05 个百分点。我们的研究结果表明，开发者应优先评估、修复并披露此类局限性，因为通过额外模态提供更多信号并不一定意味着效果更好，甚至可能导致带有偏见的结果。