Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

基于语音合成的老年人语境数据增强技术:提升老年人自动语音识别(EASR)性能

Abstract: Despite recent progress in automatic speech recognition (ASR), elderly ASR (EASR) remains challenging due to limited training data and the distinct acoustic and linguistic characteristics of elderly speech. 摘要: 尽管自动语音识别(ASR)领域近期取得了显著进展,但由于训练数据有限,加之老年人语音在声学和语言特征上的独特性,老年人自动语音识别(EASR)仍面临巨大挑战。

In this work, we address data scarcity in EASR through a data augmentation pipeline that combines large language model (LLM)-based transcript paraphrasing with text-to-speech (TTS) synthesis. 在这项工作中,我们通过一种结合了基于大语言模型(LLM)的文本改写与语音合成(TTS)的数据增强流程,解决了 EASR 中的数据稀缺问题。

Given an elderly speech dataset, the LLM first generates elderly-contextual paraphrases of the original transcripts, and the TTS model then synthesizes corresponding speech using elderly reference speakers. 给定一个老年人语音数据集,LLM 首先会生成符合老年人语境的原始文本改写版本,随后 TTS 模型利用老年人参考说话人的声音合成相应的语音。

The resulting synthetic audio-text pairs are merged with the original data to fine-tune Whisper without architectural modification. 生成的合成音视频对随后与原始数据合并,用于在不修改模型架构的情况下对 Whisper 模型进行微调。

We further analyze the effects of augmentation ratio and reference-speaker composition in low-resource EASR. 我们进一步分析了在低资源 EASR 环境下,增强比例和参考说话人构成对模型效果的影响。

Experiments on English and Korean elderly speech datasets from speakers aged 70 and above show that the proposed method consistently improves performance over conventional augmentation baselines, achieving up to a 58.2% reduction in word error rate (WER) compared with the Whisper baseline. 针对 70 岁及以上老年人英语和韩语语音数据集的实验表明,该方法在性能上始终优于传统的增强基准,与 Whisper 基准相比,词错误率(WER)最高降低了 58.2%。