SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

Abstract: Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speaker-specific acoustic variation.

摘要: 语音合成领域的最新进展已从音素表示转向直接的字位(grapheme)建模。虽然音素解决了文本与声学之间的一对多映射问题,但它们依赖于字位转音素(G2P)系统,而这些系统无法捕捉特定于说话人的声学差异。

Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings. In this paper, we propose SPARCLE, a speaker-aware grapheme representation model that enriches characters with their precise acoustic realizations.

先前的研究表明,基于字位的模型在大规模数据下优于基于音素的系统,但在低资源环境下表现不佳。在本文中,我们提出了 SPARCLE,这是一种具备说话人感知能力的字位表示模型,它通过精确的声学实现来丰富字符信息。

SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity. The resulting model serves as a replacement to G2P systems for downstream text-to-speech (TTS) tasks.

SPARCLE 通过对比学习目标进行训练,在以说话人身份为条件的情况下,将字位与相应的 Wav2Vec2 声学表示进行对齐。所得模型可作为下游语音合成(TTS)任务中 G2P 系统的替代方案。

We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.

我们证明了 SPARCLE 提升了生成质量,与标准的字位模型相比,在极端低资源环境下,其词错误率降低了一半。


Paper Details:

  • Authors: Priyam Mazumdar, Yurii Halychanskyi, Steven Guo, Mark Hasegawa-Johnson, Volodymyr Kindratenko
  • arXiv ID: 2607.01238
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

论文详情:

  • 作者: Priyam Mazumdar, Yurii Halychanskyi, Steven Guo, Mark Hasegawa-Johnson, Volodymyr Kindratenko
  • arXiv ID: 2607.01238
  • 学科分类: 计算与语言 (cs.CL);人工智能 (cs.AI);声音 (cs.SD);音频与语音处理 (eess.AS)