VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

VITA-QinYu：用于角色扮演与歌唱的表达性口语语言模型

Abstract: Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. 摘要： 人类语音所传达的表达力远超语言内容本身，还包含了性格、情绪或表演元素，例如安慰的语调或哼唱歌曲，我们将这些形式定义为角色扮演与歌唱。

We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. 我们推出了 VITA-QinYu，这是首个超越自然对话、同时支持角色扮演与歌唱生成的表达性端到端（E2E）口语语言模型（SLM）。

VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. VITA-QinYu 采用了一种混合语音-文本范式，通过多码本音频标记（multi-codebook audio tokens）扩展了交错式文本-音频建模。这种设计在实现更丰富的副语言表达的同时，保持了模态间的清晰界限，从而避免了干扰。

We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. 我们进一步开发了一套全面的数据生成流水线，合成了总计 1.58 万小时的自然对话、角色扮演和歌唱数据用于模型训练。

VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. VITA-QinYu 展示了卓越的表达能力，在客观角色扮演基准测试中比同类 SLM 提高了 7 个百分点，在歌唱任务的 5 分制 MOS 评分中也超过了同类模型 0.13 分。

Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. 同时，它在对话准确性和流畅度方面达到了行业领先水平，在 C3 和 URO 基准测试中分别比先前的 SLM 提高了 1.38 和 4.98 个百分点。

We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction. 我们已将代码和模型开源，并提供了一个易于使用的演示程序，全面支持流式传输和全双工交互。