jamiepine / voicebox

jamiepine / voicebox

Voicebox: The open-source AI voice studio. Clone any voice. Generate speech. Dictate into any app. Talk to agents in voices you own. The full voice I/O stack, running locally on your machine. Voicebox:开源 AI 语音工作室。克隆任意声音,生成语音,在任何应用中进行听写,并用你拥有的声音与 AI 智能体对话。这是一套完整的语音输入/输出(I/O)栈,完全在你的本地机器上运行。

voicebox.sh • Docs • Download • Features • API • Troubleshooting voicebox.sh • 文档 • 下载 • 功能 • API • 故障排除

Click the image above to watch the demo video on voicebox.sh 点击上方图片,在 voicebox.sh 观看演示视频。

What is Voicebox? Voicebox is a local-first AI voice studio — a free and open-source alternative to ElevenLabs and WisprFlow in one app. Clone voices from a few seconds of audio, generate speech in 23 languages across 7 TTS engines, dictate into any text field with a global hotkey, and give any MCP-aware AI agent a voice of your choosing. 什么是 Voicebox? Voicebox 是一款“本地优先”的 AI 语音工作室,它将 ElevenLabs 和 WisprFlow 的功能合二为一,是一款免费且开源的替代方案。只需几秒钟的音频即可克隆声音,支持 7 种 TTS 引擎和 23 种语言的语音生成,通过全局快捷键在任何文本框中进行听写,并为你选择的任何支持 MCP 的 AI 智能体赋予声音。

The two cloud incumbents sit on opposite halves of the voice I/O loop — ElevenLabs on output, WisprFlow on input. Voicebox does both, bridges them with a bundled local LLM for refinement and per-profile personas, and runs the whole thing on your machine. 目前云端领域的两大巨头分别占据了语音 I/O 循环的两端——ElevenLabs 负责输出,WisprFlow 负责输入。Voicebox 则兼顾两者,并通过内置的本地大语言模型(LLM)进行润色和个性化配置,将这一切完全运行在你的机器上。

  • Complete privacy — models, voice data, and captures never leave your machine

  • 7 TTS engines — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro

  • Voice cloning and preset voices — zero-shot cloning from a reference sample, or 50+ curated preset voices via Kokoro and Qwen CustomVoice

  • 23 languages — from English to Arabic, Japanese, Hindi, Swahili, and more

  • Post-processing effects — pitch shift, reverb, delay, chorus, compression, and filters

  • Expressive speech — paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice

  • Unlimited length — auto-chunking with crossfade for scripts, articles, and chapters

  • Stories editor — multi-track timeline for conversations, podcasts, and narratives

  • Voice input — global dictation hotkey with push-to-talk and toggle modes, accessibility-verified auto-paste on macOS, in-app mic on every text field, Whisper-based STT

  • Agent voice output — one tool call (voicebox.speak) and any MCP-aware agent (Claude Code, Cursor, Cline) speaks to you in a voice you’ve cloned

  • Voice personalities — attach a free-form persona to any voice profile, then Compose, Rewrite, or Respond via a bundled local LLM — agents can invoke the same modes over MCP

  • API-first — REST API plus a built-in MCP server for integrating voice I/O into your own apps and agents

  • Native performance — built with Tauri (Rust), not Electron

  • Runs everywhere — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

  • 完全隐私 — 模型、语音数据和录音绝不会离开你的机器。

  • 7 种 TTS 引擎 — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA 以及 Kokoro。

  • 声音克隆与预设音色 — 通过参考样本进行零样本(zero-shot)克隆,或使用 Kokoro 和 Qwen CustomVoice 提供的 50 多种精选预设音色。

  • 23 种语言 — 从英语到阿拉伯语、日语、印地语、斯瓦希里语等。

  • 后期处理效果 — 音高偏移、混响、延迟、合唱、压缩和滤波。

  • 表现力语音 — 通过 Chatterbox Turbo 支持 [laugh](笑)、[sigh](叹气)、[gasp](喘息)等副语言标签;通过 Qwen CustomVoice 实现自然语言表达控制。

  • 无限长度 — 针对脚本、文章和章节提供带交叉淡入淡出的自动分段功能。

  • 故事编辑器 — 用于对话、播客和叙事的轨道时间轴。

  • 语音输入 — 全局听写快捷键,支持按键说话和切换模式;macOS 上经辅助功能验证的自动粘贴;每个文本框内嵌麦克风;基于 Whisper 的语音转文字(STT)。

  • 智能体语音输出 — 只需一个工具调用 (voicebox.speak),任何支持 MCP 的智能体(如 Claude Code, Cursor, Cline)都能用你克隆的声音与你对话。

  • 语音个性 — 为任何语音配置文件附加自由格式的个性设定,通过内置本地 LLM 进行创作、重写或回复——智能体可以通过 MCP 调用相同模式。

  • API 优先 — 提供 REST API 和内置 MCP 服务器,方便将语音 I/O 集成到你自己的应用和智能体中。

  • 原生性能 — 使用 Tauri (Rust) 构建,而非 Electron。

  • 全平台运行 — 支持 macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker。


Download Platform / 下载平台

  • macOS (Apple Silicon): Download DMG

  • macOS (Intel): Download DMG

  • Windows: Download MSI

  • Docker: docker compose up

  • View all binaries →

  • Linux: Pre-built binaries are not yet available. See voicebox.sh/linux-install for build-from-source instructions.

  • macOS (Apple Silicon): 下载 DMG

  • macOS (Intel): 下载 DMG

  • Windows: 下载 MSI

  • Docker: docker compose up

  • 查看所有二进制文件 →

  • Linux: 暂无预编译二进制文件。请参阅 voicebox.sh/linux-install 获取源码编译说明。

Having trouble? See the Troubleshooting Guide for common install, generation, model-download, and GPU issues. 遇到问题?请查看《故障排除指南》,了解常见的安装、生成、模型下载及 GPU 相关问题。


Features: Multi-Engine Voice Cloning / 功能:多引擎声音克隆

Seven TTS engines with different strengths, switchable per-generation: 七种各具优势的 TTS 引擎,可在每次生成时切换:

EngineLanguagesStrengths
Qwen3-TTS (0.6B / 1.7B)10High-quality multilingual cloning, delivery instructions (“speak slowly”, “whisper”)
Qwen CustomVoice109 curated preset voices with natural-language delivery control — no reference audio required
LuxTTSEnglishLightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox Multilingual23Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more
Chatterbox TurboEnglishFast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B)10HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment
Kokoro850 curated preset voices, tiny 82M model, fast CPU inference
引擎语言优势
Qwen3-TTS (0.6B / 1.7B)10高质量多语言克隆,支持表达指令(如“慢点说”、“耳语”)
Qwen CustomVoice109 种精选预设音色,支持自然语言表达控制,无需参考音频
LuxTTS英语轻量级(约 1GB 显存),48kHz 输出,CPU 上可达 150 倍实时速度
Chatterbox Multilingual23最广泛的语言覆盖——阿拉伯语、丹麦语、芬兰语、希腊语、希伯来语、印地语、马来语、挪威语、波兰语、斯瓦希里语、瑞典语、土耳其语等
Chatterbox Turbo英语快速的 350M 模型,支持副语言情感/声音标签
TADA (1B / 3B)10HumeAI 语音语言模型——支持 700 秒以上连贯音频,文本-声学双重对齐
Kokoro850 种精选预设音色,超小 82M 模型,CPU 推理速度快

Emotions & Paralinguistic Tags / 情感与副语言标签

Only Chatterbox Turbo interprets paralinguistic tags like [laugh] and [sigh]. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them literally as text. With Chatterbox Turbo selected, type / in the text input to open the tag inserter and add expressive tags inline with speech: 只有 Chatterbox Turbo 能解析 [laugh](笑)和 [sigh](叹气)等副语言标签。Qwen3-TTS、LuxTTS、Chatterbox Multilingual 和 HumeAI TADA 会将它们作为文本字面读取。选中 Chatterbox Turbo 后,在文本输入框中输入 / 即可打开标签插入器,在语音中添加表现力标签:

[laugh] [chuckle] [gasp] [cough] [sigh] [groan] [sniff] [shush] [clear throat]


Post-Processing Effects / 后期处理效果

8 audio effects powered by Spotify’s pedalboard library. Apply after generation, preview in real time, build reusable presets. 由 Spotify 的 pedalboard 库驱动的 8 种音频效果。生成后应用,实时预览,并构建可复用的预设。

  • Pitch Shift: Up or down by up to 12 semitones (音高偏移:最高上下 12 个半音)
  • Reverb: Configurable room size, damping, wet/dry mix (混响:可配置房间大小、阻尼、干湿比)
  • Delay: Echo with adjustable time, feedback, and mix (延迟:带可调时间、反馈和混合的回声)
  • Chorus / Flanger: Modulated delay for metallic or lush textures (合唱/镶边:用于金属感或丰富质感的调制延迟)
  • Compressor: Dynamic range compression (压缩器:动态范围压缩)
  • Gain: Volume adjustment (-40 to +40 dB) (增益:音量调节 -40 到 +40 dB)
  • High-Pass Filter: Remove low frequencies (高通滤波:去除低频)
  • Low-Pass Filter: Remove high frequencies (低通滤波:去除高频)

Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults. 内置 4 种预设(机器人、收音机、回声室、深沉嗓音),并支持自定义预设。效果可按配置文件分配为默认值。


Unlimited Generation Length / 无限生成长度

Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines. 文本会在句子边界自动拆分,每个片段独立生成,然后通过交叉淡入淡出合并。适用于所有引擎。

  • Configurable auto-chunking limit (100–5,000 chars) (可配置自动分段限制:100–5,000 字符)
  • Crossfade slider (0–200ms) for smooth transitions (交叉淡入淡出滑块:0–200ms,实现平滑过渡)
  • Max text length: 50,000 characters (最大文本长度:50,000 字符)
  • Smart splitting respects abbreviations, CJK punctuation, and [tags] (智能拆分:识别缩写、中日韩标点符号及 [标签])

Generation Versions / 生成版本

Every generation supports multiple versions with provenance tracking: 每次生成都支持多个版本,并带有来源追踪:

  • Original — clean TTS output, always preserved (原始:纯净的 TTS 输出,始终保留)
  • Effects versions — apply different effects chains from any source version (效果版本:从任何源版本应用不同的效果链)
  • Takes — regenerate with a new seed for variation (拍摄:使用新种子重新生成以获得变化)
  • Source tracking — each version records its lineage (来源追踪:每个版本记录其血统)
  • Favorites — star generations for quick access (收藏:标记生成结果以便快速访问)

Async Generation Queue / 异步生成队列

  • Generation is non-blocking. Submit and immediately start typing the next one. (生成是非阻塞的。提交后可立即开始输入下一条。)
  • Serial execution queue prevents GPU contention (串行执行队列防止 GPU 争用)
  • Real-time SSE status streaming (实时 SSE 状态流)
  • Failed generations can be retried (失败的生成可重试)
  • Stale generations from crashes auto-recover on startup (崩溃导致的陈旧生成在启动时自动恢复)

Voice Profile Management / 语音配置文件管理

  • Create profiles from audio files or record directly in-app (从音频文件创建配置文件或直接在应用内录制)
  • Import/export profiles to share or back up (导入/导出配置文件以进行共享或备份)
  • Multi-sample support for higher quality cloning (支持多样本以获得更高质量的克隆)
  • Per-profile default effects chains (每个配置文件可设置默认效果链)
  • Organize with… (使用…进行组织)