Reachy Mini goes fully local
Reachy Mini goes fully local
Reachy Mini 实现完全本地化
After building your Reachy Mini, you’ll install the conversation app and start talking to it. Until now, you had to send your audio to a server. But not anymore. Today we’ll walk you through running the whole stack locally. This stack is powered by speech-to-speech, our cascaded VAD → STT → LLM → TTS pipeline that exposes a Realtime API-compatible /v1/realtime WebSocket. Once you launch the backend, point the robot at it from the UI.
在组装好 Reachy Mini 后,你需要安装对话应用程序并开始与它交流。在此之前,你必须将音频发送到服务器才能实现对话。但现在情况不同了。今天,我们将指导你如何在本地运行整个技术栈。该技术栈由我们的 speech-to-speech 项目驱动,这是一个级联的 VAD(语音活动检测)→ STT(语音转文字)→ LLM(大语言模型)→ TTS(文字转语音)流水线,它提供了一个兼容 Realtime API 的 /v1/realtime WebSocket 接口。一旦启动后端,只需在 UI 中将机器人指向该地址即可。
Cascades are the most flexible option in the open-source landscape today, and with the right pieces they’re also the fastest. We’ll recommend the components we like best, but the whole point of a cascade is that you can swap them. New models drop every week.
级联架构是目前开源领域中最灵活的选择,如果配置得当,它也是速度最快的。我们将推荐我们最喜欢的组件,但级联架构的核心优势在于你可以随时替换它们。毕竟,每周都有新的模型发布。
TL;DR
简而言之
Deploy a local speech backend for your Reachy Mini. We use our speech-to-speech library, a cascade approach. Recommended: llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT 0.6B v3 STT, Qwen3-TTS.
为你的 Reachy Mini 部署一个本地语音后端。我们使用自己的 speech-to-speech 库,这是一种级联方案。推荐配置:运行 Gemma 4 的 llama.cpp、Silero VAD、Parakeet-TDT 0.6B v3 STT 以及 Qwen3-TTS。
Quick start
快速入门
This blog walks you through running conversations with Reachy Mini fully locally. No cloud, no API keys, no data leaving your machine.
本博客将引导你实现 Reachy Mini 的完全本地化对话。无需云端,无需 API 密钥,数据绝不会离开你的设备。
Locally serving the LLM
本地部署 LLM
To serve the LLM, we’ll use Hugging Face’s llama.cpp. If you need to install it, the simplest way is brew install llama.cpp or winget install llama.cpp, for more help, check the docs. First, we’ll run:
为了部署 LLM,我们将使用 Hugging Face 的 llama.cpp。如果你需要安装它,最简单的方法是使用 brew install llama.cpp 或 winget install llama.cpp,更多帮助请查阅文档。首先,我们运行:
llama-server -hf ggml-org/gemma-4-E4B-it-GGUF -np 2 -c 65536 -fa on --swa-full
And done! The first time it will download the model, subsequent launches are fast. What do those flags do?
搞定!首次运行时它会下载模型,后续启动速度会很快。这些参数有什么作用?
-
-hf ggml-org/gemma-4-E4B-it-GGUF— pulls the model straight from the Hub. First run downloads it, subsequent runs use the cache. -
-np 2— two parallel slots. Lets the server handle a second request (e.g. a quick interruption) without blocking on the first. -
-c 65536— 64k context window, shared across slots. Plenty of headroom for long conversations. -
-fa on— flash attention. Faster and lower memory, basically free on modern hardware. -
--swa-full— keeps the full sliding-window attention cache instead of recomputing it. Trades a bit of RAM for noticeably faster prompt processing on Gemma. -
-hf ggml-org/gemma-4-E4B-it-GGUF:直接从 Hugging Face Hub 拉取模型。首次运行下载,后续运行使用缓存。 -
-np 2:两个并行槽位。允许服务器处理第二个请求(例如快速打断),而不会阻塞第一个请求。 -
-c 65536:64k 上下文窗口,在槽位间共享。为长对话提供了充足的空间。 -
-fa on:Flash Attention。速度更快且内存占用更低,在现代硬件上几乎是“免费”的性能提升。 -
--swa-full:保留完整的滑动窗口注意力缓存,而不是重新计算。以少量内存为代价,显著加快了 Gemma 的提示词处理速度。
Setting up speech-to-speech
设置 speech-to-speech
We’ll begin by simply installing the library:
uv pip install speech-to-speech
我们首先安装该库:
uv pip install speech-to-speech
Then, while we are serving the LLM in another terminal, we can simply run:
speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local
然后,在另一个终端运行 LLM 服务的同时,我们只需运行:
speech-to-speech --responses_api_base_url "http://127.0.0.1:8080" --responses_api_api_key "" --mode local
And you can start talking to the model through your terminal! The first time it will need to download Parakeet-TDT 0.6B v3 and Qwen3TTS, but subsequent launches are fast.
现在你就可以通过终端与模型对话了!首次运行时需要下载 Parakeet-TDT 0.6B v3 和 Qwen3TTS,但后续启动会很快。
Going deeper
深入了解
Why run your own Speech-to-Speech server? Hosted realtime backends are convenient, but running your own engine unlocks three things:
- Privacy. Audio never leaves your network, the entire pipeline runs on hardware you control.
- No API costs. No per-minute or per-token fees.
- Full control over the pipeline. Swap any piece: VAD, STT, LLM, TTS. Whenever something better lands on the Hub 🤗.
为什么要运行自己的 Speech-to-Speech 服务器?托管的实时后端虽然方便,但运行自己的引擎有三个好处:
- 隐私性:音频绝不会离开你的网络,整个流水线都在你控制的硬件上运行。
- 无 API 成本:没有按分钟或按 Token 收取的费用。
- 对流水线的完全控制:可以替换任何组件(VAD、STT、LLM、TTS),只要 Hub 上有更好的模型出现,你就能随时更换 🤗。