jamiepine / voicebox

Voicebox: The open-source AI voice studio. Clone any voice. Generate speech. Dictate into any app. Talk to agents in voices you own. The full voice I/O stack, running locally on your machine. Voicebox：开源 AI 语音工作室。克隆任意声音，生成语音，在任何应用中进行听写，并用你拥有的声音与 AI 智能体对话。这是一套完整的语音输入/输出（I/O）栈，完全在你的本地机器上运行。

voicebox.sh • Docs • Download • Features • API • Troubleshooting voicebox.sh • 文档 • 下载 • 功能 • API • 故障排除

Click the image above to watch the demo video on voicebox.sh 点击上方图片，在 voicebox.sh 观看演示视频。

What is Voicebox? Voicebox is a local-first AI voice studio — a free and open-source alternative to ElevenLabs and WisprFlow in one app. Clone voices from a few seconds of audio, generate speech in 23 languages across 7 TTS engines, dictate into any text field with a global hotkey, and give any MCP-aware AI agent a voice of your choosing. 什么是 Voicebox？ Voicebox 是一款“本地优先”的 AI 语音工作室，它将 ElevenLabs 和 WisprFlow 的功能合二为一，是一款免费且开源的替代方案。只需几秒钟的音频即可克隆声音，支持 7 种 TTS 引擎和 23 种语言的语音生成，通过全局快捷键在任何文本框中进行听写，并为你选择的任何支持 MCP 的 AI 智能体赋予声音。

The two cloud incumbents sit on opposite halves of the voice I/O loop — ElevenLabs on output, WisprFlow on input. Voicebox does both, bridges them with a bundled local LLM for refinement and per-profile personas, and runs the whole thing on your machine. 目前云端领域的两大巨头分别占据了语音 I/O 循环的两端——ElevenLabs 负责输出，WisprFlow 负责输入。Voicebox 则兼顾两者，并通过内置的本地大语言模型（LLM）进行润色和个性化配置，将这一切完全运行在你的机器上。

Complete privacy — models, voice data, and captures never leave your machine
7 TTS engines — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro
Voice cloning and preset voices — zero-shot cloning from a reference sample, or 50+ curated preset voices via Kokoro and Qwen CustomVoice
23 languages — from English to Arabic, Japanese, Hindi, Swahili, and more
Post-processing effects — pitch shift, reverb, delay, chorus, compression, and filters
Expressive speech — paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice
Unlimited length — auto-chunking with crossfade for scripts, articles, and chapters
Stories editor — multi-track timeline for conversations, podcasts, and narratives
Voice input — global dictation hotkey with push-to-talk and toggle modes, accessibility-verified auto-paste on macOS, in-app mic on every text field, Whisper-based STT
Agent voice output — one tool call (voicebox.speak) and any MCP-aware agent (Claude Code, Cursor, Cline) speaks to you in a voice you’ve cloned
Voice personalities — attach a free-form persona to any voice profile, then Compose, Rewrite, or Respond via a bundled local LLM — agents can invoke the same modes over MCP
API-first — REST API plus a built-in MCP server for integrating voice I/O into your own apps and agents
Native performance — built with Tauri (Rust), not Electron
Runs everywhere — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker
完全隐私 — 模型、语音数据和录音绝不会离开你的机器。
7 种 TTS 引擎 — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA 以及 Kokoro。
声音克隆与预设音色 — 通过参考样本进行零样本（zero-shot）克隆，或使用 Kokoro 和 Qwen CustomVoice 提供的 50 多种精选预设音色。
23 种语言 — 从英语到阿拉伯语、日语、印地语、斯瓦希里语等。
后期处理效果 — 音高偏移、混响、延迟、合唱、压缩和滤波。
表现力语音 — 通过 Chatterbox Turbo 支持 [laugh]（笑）、[sigh]（叹气）、[gasp]（喘息）等副语言标签；通过 Qwen CustomVoice 实现自然语言表达控制。
无限长度 — 针对脚本、文章和章节提供带交叉淡入淡出的自动分段功能。
故事编辑器 — 用于对话、播客和叙事的轨道时间轴。
语音输入 — 全局听写快捷键，支持按键说话和切换模式；macOS 上经辅助功能验证的自动粘贴；每个文本框内嵌麦克风；基于 Whisper 的语音转文字（STT）。
智能体语音输出 — 只需一个工具调用 (voicebox.speak)，任何支持 MCP 的智能体（如 Claude Code, Cursor, Cline）都能用你克隆的声音与你对话。
语音个性 — 为任何语音配置文件附加自由格式的个性设定，通过内置本地 LLM 进行创作、重写或回复——智能体可以通过 MCP 调用相同模式。
API 优先 — 提供 REST API 和内置 MCP 服务器，方便将语音 I/O 集成到你自己的应用和智能体中。
原生性能 — 使用 Tauri (Rust) 构建，而非 Electron。
全平台运行 — 支持 macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker。

Download Platform / 下载平台

macOS (Apple Silicon): Download DMG
macOS (Intel): Download DMG
Windows: Download MSI
Docker: docker compose up
View all binaries →
Linux: Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.
macOS (Apple Silicon): 下载 DMG
macOS (Intel): 下载 DMG
Windows: 下载 MSI
Docker: docker compose up
查看所有二进制文件 →
Linux: 暂无预编译二进制文件。请参阅 voicebox.sh/linux-install 获取源码编译说明。

Having trouble? See the Troubleshooting Guide for common install, generation, model-download, and GPU issues. 遇到问题？请查看《故障排除指南》，了解常见的安装、生成、模型下载及 GPU 相关问题。

Features: Multi-Engine Voice Cloning / 功能：多引擎声音克隆

Seven TTS engines with different strengths, switchable per-generation: 七种各具优势的 TTS 引擎，可在每次生成时切换：

Engine	Languages	Strengths
Qwen3-TTS (0.6B / 1.7B)	10	High-quality multilingual cloning, delivery instructions (“speak slowly”, “whisper”)
Qwen CustomVoice	10	9 curated preset voices with natural-language delivery control — no reference audio required
LuxTTS	English	Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox Multilingual	23	Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more
Chatterbox Turbo	English	Fast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B)	10	HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment
Kokoro	8	50 curated preset voices, tiny 82M model, fast CPU inference

引擎	语言	优势
Qwen3-TTS (0.6B / 1.7B)	10	高质量多语言克隆，支持表达指令（如“慢点说”、“耳语”）
Qwen CustomVoice	10	9 种精选预设音色，支持自然语言表达控制，无需参考音频
LuxTTS	英语	轻量级（约 1GB 显存），48kHz 输出，CPU 上可达 150 倍实时速度
Chatterbox Multilingual	23	最广泛的语言覆盖——阿拉伯语、丹麦语、芬兰语、希腊语、希伯来语、印地语、马来语、挪威语、波兰语、斯瓦希里语、瑞典语、土耳其语等
Chatterbox Turbo	英语	快速的 350M 模型，支持副语言情感/声音标签
TADA (1B / 3B)	10	HumeAI 语音语言模型——支持 700 秒以上连贯音频，文本-声学双重对齐
Kokoro	8	50 种精选预设音色，超小 82M 模型，CPU 推理速度快

Emotions & Paralinguistic Tags / 情感与副语言标签

Only Chatterbox Turbo interprets paralinguistic tags like [laugh] and [sigh]. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them literally as text. With Chatterbox Turbo selected, type / in the text input to open the tag inserter and add expressive tags inline with speech: 只有 Chatterbox Turbo 能解析 [laugh]（笑）和 [sigh]（叹气）等副语言标签。Qwen3-TTS、LuxTTS、Chatterbox Multilingual 和 HumeAI TADA 会将它们作为文本字面读取。选中 Chatterbox Turbo 后，在文本输入框中输入 / 即可打开标签插入器，在语音中添加表现力标签：

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]

Post-Processing Effects / 后期处理效果

8 audio effects powered by Spotify’s pedalboard library. Apply after generation, preview in real time, build reusable presets. 由 Spotify 的 pedalboard 库驱动的 8 种音频效果。生成后应用，实时预览，并构建可复用的预设。

Pitch Shift: Up or down by up to 12 semitones (音高偏移：最高上下 12 个半音)
Reverb: Configurable room size, damping, wet/dry mix (混响：可配置房间大小、阻尼、干湿比)
Delay: Echo with adjustable time, feedback, and mix (延迟：带可调时间、反馈和混合的回声)
Chorus / Flanger: Modulated delay for metallic or lush textures (合唱/镶边：用于金属感或丰富质感的调制延迟)
Compressor: Dynamic range compression (压缩器：动态范围压缩)
Gain: Volume adjustment (-40 to +40 dB) (增益：音量调节 -40 到 +40 dB)
High-Pass Filter: Remove low frequencies (高通滤波：去除低频)
Low-Pass Filter: Remove high frequencies (低通滤波：去除高频)

Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults. 内置 4 种预设（机器人、收音机、回声室、深沉嗓音），并支持自定义预设。效果可按配置文件分配为默认值。

Unlimited Generation Length / 无限生成长度

Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines. 文本会在句子边界自动拆分，每个片段独立生成，然后通过交叉淡入淡出合并。适用于所有引擎。

Configurable auto-chunking limit (100–5,000 chars) (可配置自动分段限制：100–5,000 字符)
Crossfade slider (0–200ms) for smooth transitions (交叉淡入淡出滑块：0–200ms，实现平滑过渡)
Max text length: 50,000 characters (最大文本长度：50,000 字符)
Smart splitting respects abbreviations, CJK punctuation, and [tags] (智能拆分：识别缩写、中日韩标点符号及 [标签])

Generation Versions / 生成版本

Every generation supports multiple versions with provenance tracking: 每次生成都支持多个版本，并带有来源追踪：

Original — clean TTS output, always preserved (原始：纯净的 TTS 输出，始终保留)
Effects versions — apply different effects chains from any source version (效果版本：从任何源版本应用不同的效果链)
Takes — regenerate with a new seed for variation (拍摄：使用新种子重新生成以获得变化)
Source tracking — each version records its lineage (来源追踪：每个版本记录其血统)
Favorites — star generations for quick access (收藏：标记生成结果以便快速访问)

Async Generation Queue / 异步生成队列

Generation is non-blocking. Submit and immediately start typing the next one. (生成是非阻塞的。提交后可立即开始输入下一条。)
Serial execution queue prevents GPU contention (串行执行队列防止 GPU 争用)
Real-time SSE status streaming (实时 SSE 状态流)
Failed generations can be retried (失败的生成可重试)
Stale generations from crashes auto-recover on startup (崩溃导致的陈旧生成在启动时自动恢复)

Voice Profile Management / 语音配置文件管理

Create profiles from audio files or record directly in-app (从音频文件创建配置文件或直接在应用内录制)
Import/export profiles to share or back up (导入/导出配置文件以进行共享或备份)
Multi-sample support for higher quality cloning (支持多样本以获得更高质量的克隆)
Per-profile default effects chains (每个配置文件可设置默认效果链)
Organize with… (使用…进行组织)