OpenBMB / VoxCPM

VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

VoxCPM2：面向多语言语音生成、创意音色设计与高保真克隆的无 Tokenizer 语音合成系统

👋 Join our community for discussion and support! Feishu | Discord

👋 欢迎加入我们的社区进行讨论与获取支持！飞书 | Discord

VoxCPM is a tokenizer-free Text-to-Speech system that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization to achieve highly natural and expressive synthesis. VoxCPM2 is the latest major release — a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.

VoxCPM 是一个无需 Tokenizer 的语音合成系统，通过端到端的扩散自回归架构直接生成连续的语音表征，绕过了离散化过程，从而实现了高度自然且富有表现力的合成效果。VoxCPM2 是最新的重大版本——一个拥有 20 亿参数的模型，在超过 200 万小时的多语言语音数据上进行训练，现已支持 30 种语言、音色设计、可控语音克隆以及 48kHz 录音室级音频输出。该模型基于 MiniCPM-4 主干构建。

✨ Highlights / ✨ 亮点

🌍 30-Language Multilingual — Input text in any of the 30 supported languages and synthesize directly, no language tag needed. 🌍 30 种多语言支持 — 输入 30 种支持语言中的任意文本即可直接合成，无需语言标签。
🎨 Voice Design — Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace …), no reference audio required. 🎨 音色设计 — 仅凭自然语言描述（性别、年龄、语调、情感、语速等）即可创造全新的音色，无需参考音频。
🎛️ Controllable Cloning — Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre. 🎛️ 可控语音克隆 — 通过短参考片段克隆任意音色，并可选择添加风格引导，在保留原始音色的同时控制情感、语速和表现力。
🎙️ Ultimate Cloning — Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail — timbre, rhythm, emotion, and style (same as VoxCPM1.5). 🎙️ 极致克隆 — 重现每一个语音细节：提供参考音频及其转录文本，模型可从参考音频处无缝衔接，忠实保留音色、节奏、情感和风格等所有语音细节（与 VoxCPM1.5 相同）。
🔊 48kHz High-Quality Audio — Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2’s asymmetric encode/decode design, with built-in super-resolution — no external upsampler needed. 🔊 48kHz 高质量音频 — 通过 AudioVAE V2 的非对称编解码设计，支持输入 16kHz 参考音频并直接输出 48kHz 录音室级音频，内置超分辨率功能，无需外部上采样器。
🧠 Context-Aware Synthesis — Automatically infers appropriate prosody and expressiveness from text content. 🧠 上下文感知合成 — 根据文本内容自动推断合适的韵律和表现力。
⚡ Real-Time Streaming — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by Nano-vLLM or vLLM-Omni — official vLLM omni-modal serving for VoxCPM2 with PagedAttention and an OpenAI-compatible API. ⚡ 实时流式传输 — 在 NVIDIA RTX 4090 上 RTF 低至约 0.3，使用 Nano-vLLM 或 vLLM-Omni 加速后可低至约 0.13。官方提供支持 PagedAttention 和 OpenAI 兼容 API 的 VoxCPM2 全模态 vLLM 服务。
📜 Fully Open-Source & Commercial-Ready — Weights and code released under the Apache-2.0 license, free for commercial use. 📜 完全开源且可商用 — 权重和代码均在 Apache-2.0 许可下发布，可免费商用。

🌍 Supported Languages (30) / 🌍 支持语言 (30)

Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese. 阿拉伯语、缅甸语、中文、丹麦语、荷兰语、英语、芬兰语、法语、德语、希腊语、希伯来语、印地语、印尼语、意大利语、日语、高棉语、韩语、老挝语、马来语、挪威语、波兰语、葡萄牙语、俄语、西班牙语、斯瓦希里语、瑞典语、他加禄语、泰语、土耳其语、越南语。

Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话. 中国方言： 四川话、粤语、吴语、东北话、河南话、陕西话、山东话、天津话、闽南话。

🚀 Quick Start / 快速开始

Installation / 安装

pip install voxcpm

Requirements: Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0. 要求：Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0。

Python API / Python 接口

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

🎨 Voice Design / 音色设计

Create a voice from a natural-language description — no reference audio needed. 仅凭自然语言描述即可创建音色，无需参考音频。

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)

🎛️ Controllable Voice Cloning / 可控语音克隆

Upload a reference audio. The model clones the timbre, and you can still use control instructions to adjust speed, emotion, or style. 上传参考音频。模型会克隆音色，你依然可以使用控制指令来调整语速、情感或风格。

wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)

🎙️ Ultimate Cloning / 极致克隆

Provide both the reference audio and its exact transcript for audio-continuation-based cloning. 提供参考音频及其精确转录文本，以实现基于音频续写的克隆。

wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav",
)