Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

弥合稳定性与表现力之间的鸿沟：低资源口语语言模型的合成数据扩展与偏好对齐

Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient.

摘要： 口语语言模型（SLMs）通过绕过显式的字素到音素（grapheme-to-phoneme）流程，已成为语音合成领域一种极具前景的范式。然而，由于转录语音数据的匮乏，它们在低资源语言中的有效性受到了根本性的限制。在实践中，合成数据已成为在此类场景下扩展 SLMs 的主要策略，能够在真实数据不足时提供可靠的语音监督。

In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion).

在这项工作中，我们指出这种依赖引入了一个根本性的权衡，我们将其称为“稳定性-表现力鸿沟”（Stability-Expressivity Gap）：虽然合成数据提高了语音准确性，但它会逐渐抑制韵律的多样性，最终导致表现力的崩溃（即“合成侵蚀”现象）。

To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering.

为了弥合这一鸿沟，我们提出了两种自对齐框架。解耦引导自对齐（DGSA）通过利用韵律与音色的分离，为复杂语言恢复了表现力。对于真实参考样本极其有限的场景，温度驱动自评（TDSC）通过自动化的探索与过滤机制，稳定了生成效果。

Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

我们的方法优于包括 ElevenLabs 和 Gemini Pro 在内的强大商业系统，并实现了老挝语的首个零样本语音克隆能力。