VibeVoice: Open-source frontier voice AI

VibeVoice: Open-Source Frontier Voice AI

🎙️ VibeVoice: Open-Source Frontier Voice AI

News:

2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects. 2026-03-06: 🚀 VibeVoice ASR 现已加入 Transformers 发布版本！您现在可以直接通过 Hugging Face Transformers 库使用我们的语音识别模型，实现与您项目的无缝集成。

2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in Playground. ⭐️ VibeVoice-ASR is natively multilingual, supporting over 50 languages — check the supported languages for details. 🔥 The VibeVoice-ASR finetuning code is now available! ⚡️ vLLM inference is now supported for faster inference; see vllm-asr for more details. 📑 VibeVoice-ASR Technique Report is available. 2026-01-21: 📣 我们开源了 VibeVoice-ASR，这是一个统一的语音转文字模型，旨在单次处理长达 60 分钟的音频，生成包含“谁（说话人）”、“何时（时间戳）”和“什么（内容）”的结构化转录，并支持用户自定义上下文。欢迎在 Playground 中试用。⭐️ VibeVoice-ASR 原生支持多语言，涵盖超过 50 种语言——详情请查看支持语言列表。🔥 VibeVoice-ASR 微调代码现已发布！⚡️ 现已支持 vLLM 推理以实现更快的速度；详情请参阅 vllm-asr。📑 VibeVoice-ASR 技术报告现已发布。

2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. Try it. More speaker types will be added over time. 2025-12-16: 📣 我们为 VibeVoice‑Realtime‑0.5B 添加了实验性说话人供探索，包括九种语言（德、法、意、日、韩、荷、波、葡、西）的多语言语音以及 11 种独特的英语风格语音。欢迎试用，未来将添加更多说话人类型。

2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on Colab. 2025-12-03: 📣 我们开源了 VibeVoice‑Realtime‑0.5B，这是一个支持流式文本输入和稳健长文本语音生成的实时文本转语音模型。欢迎在 Colab 上试用。

2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository. 2025-09-05: VibeVoice 是一个旨在促进语音合成社区协作的开源研究框架。发布后，我们发现该工具被用于与既定意图不符的场景。由于负责任地使用 AI 是微软的指导原则之一，我们已从该存储库中移除了 VibeVoice-TTS 代码。

2025-08-25: 📣 We open-sourced VibeVoice-TTS, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers. — accepted as an Oral at ICLR 2026! 2025-08-25: 📣 我们开源了 VibeVoice-TTS，这是一个长文本多说话人文本转语音模型，可以合成长达 90 分钟的语音，并支持多达 4 个不同的说话人。——该成果已被 ICLR 2026 接收为口头报告（Oral）！

Overview

VibeVoice is a family of open-source frontier voice AI models that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

概述

VibeVoice 是一个开源的前沿语音 AI 模型系列，包含文本转语音 (TTS) 和自动语音识别 (ASR) 模型。VibeVoice 的核心创新在于使用了以 7.5 Hz 超低帧率运行的连续语音分词器（声学和语义）。这些分词器在有效保持音频保真度的同时，显著提高了处理长序列的计算效率。VibeVoice 采用了下一标记扩散框架，利用大语言模型 (LLM) 来理解文本上下文和对话流程，并使用扩散头来生成高保真的声学细节。

1. 📖 VibeVoice-ASR - Long-form Speech Recognition

VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for Customized Hotwords.

1. 📖 VibeVoice-ASR - 长文本语音识别

VibeVoice-ASR 是一个统一的语音转文字模型，旨在单次处理长达 60 分钟的长音频，生成包含“谁（说话人）”、“何时（时间戳）”和“什么（内容）”的结构化转录，并支持自定义热词。

🕒 60-minute Single-Pass Processing: Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
🕒 60 分钟单次处理： 与传统将音频切分成短片段（通常会丢失全局上下文）的 ASR 模型不同，VibeVoice ASR 可在 64K token 长度内接受长达 60 分钟的连续音频输入。这确保了整个小时内说话人追踪的一致性和语义连贯性。
👤 Customized Hotwords: Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
👤 自定义热词： 用户可以提供自定义热词（例如特定名称、技术术语或背景信息）来引导识别过程，从而显著提高特定领域内容的准确性。
📝 Rich Transcription (Who, When, What): The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.
📝 丰富转录（谁、何时、什么）： 该模型联合执行 ASR、说话人日志（diarization）和时间戳标记，生成结构化输出，明确指出谁在何时说了什么。

2. 🎙️ VibeVoice-TTS - Long-form Multi-speaker TTS

Best for: Long-form conversational audio, podcasts, multi-speaker dialogues.

2. 🎙️ VibeVoice-TTS - 长文本多说话人 TTS

适用场景：长篇对话音频、播客、多说话人对话。

⏱️ 90-minute Long-form Generation: Synthesizes conversational/single-speaker speech up to 90 minutes in a single pass, maintaining speaker consistency and semantic coherence throughout.
⏱️ 90 分钟长文本生成： 单次合成长达 90 分钟的对话/单人语音，全程保持说话人一致性和语义连贯性。
👥 Multi-speaker Support: Supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency across long dialogues.
👥 多说话人支持： 单次对话支持多达 4 个不同的说话人，在长对话中实现自然的轮流发言和说话人一致性。
🎭 Expressive Speech: Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.
🎭 表现力语音： 生成富有表现力、听感自然的语音，捕捉对话动态和情感细微差别。
🌐 Multi-lingual Support: Supports English, Chinese and other languages.
🌐 多语言支持： 支持英语、中文及其他语言。

3. ⚡ VibeVoice-Streaming - Real-time Streaming TTS

VibeVoice-Realtime is a lightweight real‑time text‑to‑speech model supporting streaming text input and robust long-form speech generation.

3. ⚡ VibeVoice-Streaming - 实时流式 TTS

VibeVoice-Realtime 是一个轻量级的实时文本转语音模型，支持流式文本输入和稳健的长文本语音生成。

Parameter size: 0.5B (deployment-friendly)
参数规模： 0.5B（易于部署）
Real-time TTS: (~300 milliseconds first audible latency)
实时 TTS： （首字发音延迟约 300 毫秒）
Streaming text input
流式文本输入
Robust long-form speech generation (~10 minutes)
稳健的长文本语音生成（约 10 分钟）

⚠️ Risks and Limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations.

⚠️ 风险与局限性

尽管我们已通过多种技术手段进行了优化，但它仍可能产生意外、带有偏见或不准确的输出。VibeVoice 继承了其基础模型（本版本中为 Qwen2.5 1.5b）所产生的任何偏见、错误或遗漏。关于深度伪造和虚假信息的潜在风险：高质量的合成语音可能被滥用于制作令人信服的虚假音频内容，用于冒充、欺诈或传播虚假信息。用户必须确保转录内容的可靠性，核实内容准确性，并避免以误导性的方式使用生成内容。用户应以合法方式使用生成内容并部署模型，完全遵守所有适用的法律法规。