Four free neural TTS options for CI pipelines — edge-tts, Kokoro, MeloTTS, Bark

Four free neural TTS options for CI pipelines — edge-tts, Kokoro, MeloTTS, Bark

四款适用于 CI 流水线的免费神经 TTS 工具:edge-tts, Kokoro, MeloTTS, Bark

Building a two-host video pipeline put me through most of the free neural TTS options that can run in GitHub Actions without a GPU. The criteria I care about: zero API cost, acceptable voice quality, runs headless in CI, and doesn’t require CUDA at inference time. Here’s a comparison of the four I tested or seriously evaluated. 在构建一个双主机视频流水线的过程中,我尝试了大多数可以在没有 GPU 的 GitHub Actions 中运行的免费神经 TTS(语音合成)方案。我关注的标准是:零 API 成本、可接受的语音质量、支持 CI 无头运行(headless),且推理时不需要 CUDA。以下是我测试或深入评估的四款工具对比。

edge-tts (what I’m using)

edge-tts(我目前使用的方案)

GitHub: rany2/edge-tts | License: MIT (wrapper) | Voices: 400+ across 100+ languages GitHub: rany2/edge-tts | 许可证:MIT (封装器) | 语音:100 多种语言,共 400 多种音色

edge-tts is a Python wrapper around Microsoft Edge’s read-aloud TTS endpoint — the same one that fires when you right-click text in Edge and select “Read aloud.” It streams MP3 output. Quality on the en-US-GuyNeural and en-US-AvaNeural voices is genuinely broadcast-quality; it’s noticeably better than older open-source models and competitive with commercial APIs. Speed is fast because it’s streaming from a remote endpoint: a 10-minute audio file generates in 30-60 seconds regardless of CI runner hardware. The catch: it calls an unofficial Microsoft endpoint. Microsoft hasn’t published a public contract for it and could restrict access without warning. I’ve been running it daily for about a month without issues, but this is a real operational risk. edge-tts 是微软 Edge 浏览器“大声朗读”功能的 Python 封装器,也就是你在 Edge 中右键点击文本选择“大声朗读”时调用的接口。它以流式传输 MP3 输出。en-US-GuyNeural 和 en-US-AvaNeural 音色的质量达到了真正的广播级;它明显优于旧的开源模型,并可与商业 API 相媲美。由于是从远程端点流式传输,速度非常快:无论 CI 运行器的硬件配置如何,10 分钟的音频文件只需 30-60 秒即可生成。缺点是:它调用的是微软的非官方端点。微软并未发布相关的公共协议,可能会在没有预警的情况下限制访问。我已经连续使用了一个月,目前没有问题,但这确实存在运营风险。

pip install edge-tts edge-tts --voice en-US-GuyNeural --text "Hello world" --write-media out.mp3

Best for: CI pipelines where voice quality matters and you can accept an external unofficial API dependency. 最适合:对语音质量有要求,且能接受外部非官方 API 依赖的 CI 流水线。


Kokoro-82M

Kokoro-82M

HuggingFace: hexgrad/Kokoro-82M | License: Apache 2.0 | Params: 82M HuggingFace: hexgrad/Kokoro-82M | 许可证:Apache 2.0 | 参数量:82M

Kokoro is a small TTS model that runs entirely locally. Voice quality is good for the model size — noticeably better than older models like Tacotron2 and FastSpeech2, though below edge-tts on naturalness for longer passages. The main tradeoff for CI: inference runs on CPU at well below real-time on a standard GitHub Actions runner. A 10-minute audio job could take significantly longer than 10 minutes to render, depending on segment count and text density. For short-form content (under 3 minutes) this is usually fine; for longer videos it’s the bottleneck. First run downloads ~320MB of model weights. If you cache these in GitHub Actions, subsequent runs skip the download. Kokoro 是一个完全本地运行的小型 TTS 模型。考虑到模型大小,其语音质量表现不错——明显优于 Tacotron2 和 FastSpeech2 等旧模型,但在长文本的自然度上略逊于 edge-tts。对于 CI 而言,主要的权衡在于:在标准的 GitHub Actions 运行器上,CPU 推理速度远低于实时速度。根据片段数量和文本密度,10 分钟的音频任务可能需要远超 10 分钟来渲染。对于短内容(3 分钟以内)通常没问题;但对于长视频,这就是瓶颈。首次运行时会下载约 320MB 的模型权重。如果在 GitHub Actions 中缓存这些权重,后续运行即可跳过下载。

from kokoro import KPipeline pipeline = KPipeline(lang_code="a") # "a" = American English audio, sr = next(pipeline("Hello world", voice="af_heart"))

Best for: fully local inference without external API calls, projects where you need auditable offline-capable TTS. 最适合:无需外部 API 调用、完全本地推理,以及需要可审计、支持离线 TTS 的项目。


MeloTTS

MeloTTS

GitHub: myshell-ai/MeloTTS | License: MIT | Languages: English, Chinese, Japanese, Korean, French, Spanish GitHub: myshell-ai/MeloTTS | 许可证:MIT | 语言:英语、中文、日语、韩语、法语、西班牙语

MeloTTS from MyShell.ai is a multilingual model with better-than-average English naturalness in my testing. The Python package is melo-tts (pip), and the API lets you set speaker ID and speed per utterance without reloading the model between clips — useful when you’re rendering hundreds of short dialogue segments in a batch. CPU inference speed is in the same range as Kokoro. Model download is around 500MB. The MIT license is a practical advantage if you’re building a product on top of it — no Apache license compatibility questions. MyShell.ai 推出的 MeloTTS 是一款多语言模型,在我的测试中,其英语自然度高于平均水平。Python 包为 melo-tts,其 API 允许你在不重新加载模型的情况下,为每个语句设置说话人 ID 和语速——这在批量渲染数百个短对话片段时非常有用。CPU 推理速度与 Kokoro 相当。模型下载大小约为 500MB。如果你基于此构建产品,MIT 许可证是一个实际优势——无需担心 Apache 许可证的兼容性问题。

from melo.api import TTS tts = TTS(language="EN", device="cpu") tts.tts_to_file("Hello world", tts.hps.data.spk2id["EN-Default"], "out.wav")

Best for: multilingual content pipelines, or when you want MIT-licensed local TTS with solid English quality. 最适合:多语言内容流水线,或者当你需要具备扎实英语质量且采用 MIT 许可证的本地 TTS 时。


Bark by Suno

Bark (Suno 出品)

GitHub: suno-ai/bark | License: MIT | Size: ~1.7GB (small), ~8GB (large) GitHub: suno-ai/bark | 许可证:MIT | 大小:约 1.7GB (小模型), 约 8GB (大模型)

Bark is the most capable of the four for voice expressiveness. You can specify laughter ([laughs]), sighs, hesitations, and non-speech sounds inline in the prompt text. Quality on the large model is competitive with commercial TTS APIs. The problem for standard CI: the large model needs a GPU with substantial VRAM and takes minutes to render 30 seconds of audio on CPU. The small model fits in RAM but quality drops noticeably. GitHub Actions standard runners have no GPU, making the large model impractical and the small model a significant quality downgrade. 在语音表现力方面,Bark 是这四款中最强的。你可以在提示文本中直接指定笑声 ([laughs])、叹息、犹豫以及非语言声音。大模型的质量可与商业 TTS API 媲美。对于标准 CI 而言,问题在于:大模型需要具备大量显存的 GPU,而在 CPU 上渲染 30 秒音频需要数分钟。小模型虽然可以放入内存,但质量明显下降。GitHub Actions 标准运行器没有 GPU,这使得大模型无法实际使用,而小模型则会导致明显的质量损失。

Best for: local GPU inference where expressive voice effects justify the hardware requirement. Not practical for standard CPU-only CI runners. 最适合:需要表现力语音效果且硬件条件允许的本地 GPU 推理场景。不适用于标准的纯 CPU CI 运行器。


Comparison

对比

ToolVoice qualityCPU speedExternal APICI practical
工具语音质量CPU 速度外部 APICI 实用性
edge-ttsexcellentfast (streaming)yes (unofficial)yes
edge-tts优秀快 (流式)是 (非官方)
Kokoro-82Mgoodslownoyes (short video)
Kokoro-82M良好是 (短视频)
MeloTTSgoodslownoyes (short video)
MeloTTS良好是 (短视频)
Bark (large)excellentvery slownono
Bark (大模型)优秀非常慢

For automated video pipelines on standard GitHub Actions runners, edge-tts is the practical choice if you accept the unofficial API dependency. If you need fully local inference and your videos stay under 3-4 minutes, Kokoro or MeloTTS both work within a reasonable job time budget. Bark belongs on a GPU machine, not a free CI runner. 对于在标准 GitHub Actions 运行器上的自动化视频流水线,如果你能接受非官方 API 依赖,edge-tts 是最实用的选择。如果你需要完全本地推理,且视频时长在 3-4 分钟以内,Kokoro 或 MeloTTS 都能在合理的任务时间预算内完成。Bark 更适合 GPU 机器,而不是免费的 CI 运行器。