Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

解锁基于提示词的语音合成模型中细粒度及句内说话风格控制

Abstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance.

摘要: 虽然基于提示词(Prompt-based)的语音合成(TTS)模型能够实现由自然语言驱动的说话风格控制,但它们往往在细粒度控制方面表现有限,且通常在整段话语中应用单一的全局风格。这限制了那些需要在不同话语间进行连续风格属性插值,以及在单句内实现随时间变化的风格转换的实际应用场景。

In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics.

在本文中,我们提出了一些新技术,旨在使现有的基于提示词的 TTS 模型具备上述两种能力。针对句间风格插值,我们在嵌入空间(embedding space)中计算对比风格提示词之间的方向向量,并执行简单的插值,从而实现风格特征之间的平滑过渡。

For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking.

针对句内风格转换,我们首先发现自回归 TTS 解码器中存在对早期 Token 的强烈注意力偏差,这导致初始的音频生成结果会主导后续的生成过程。为了缓解这一影响,我们引入了 KV 缓存交换(KV-cache swapping)和滑动窗口注意力掩码(sliding-window attention masking)技术。

Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

实验表明,我们提出的句间插值方法在性别转换方面达到了 99-100% 的成功率,音高变化可达 36 Hz,语速变化可达每秒 1.6 个音节。我们的句内转换方法保持了 0.81-0.91 的说话人相似度,并获得了 3.48-4.48 的感知平滑度评分。