SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

Abstract: Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speaker-specific acoustic variation.

摘要： 语音合成领域的最新进展已从音素表示转向直接的字位（grapheme）建模。虽然音素解决了文本与声学之间的一对多映射问题，但它们依赖于字位转音素（G2P）系统，而这些系统无法捕捉特定于说话人的声学差异。

Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings. In this paper, we propose SPARCLE, a speaker-aware grapheme representation model that enriches characters with their precise acoustic realizations.

先前的研究表明，基于字位的模型在大规模数据下优于基于音素的系统，但在低资源环境下表现不佳。在本文中，我们提出了 SPARCLE，这是一种具备说话人感知能力的字位表示模型，它通过精确的声学实现来丰富字符信息。

SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity. The resulting model serves as a replacement to G2P systems for downstream text-to-speech (TTS) tasks.

SPARCLE 通过对比学习目标进行训练，在以说话人身份为条件的情况下，将字位与相应的 Wav2Vec2 声学表示进行对齐。所得模型可作为下游语音合成（TTS）任务中 G2P 系统的替代方案。

We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.

我们证明了 SPARCLE 提升了生成质量，与标准的字位模型相比，在极端低资源环境下，其词错误率降低了一半。

Paper Details:

Authors: Priyam Mazumdar, Yurii Halychanskyi, Steven Guo, Mark Hasegawa-Johnson, Volodymyr Kindratenko
arXiv ID: 2607.01238
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

论文详情：

作者： Priyam Mazumdar, Yurii Halychanskyi, Steven Guo, Mark Hasegawa-Johnson, Volodymyr Kindratenko
arXiv ID: 2607.01238
学科分类： 计算与语言 (cs.CL)；人工智能 (cs.AI)；声音 (cs.SD)；音频与语音处理 (eess.AS)