Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

Hugging Face and Cerebras bring Gemma 4 to real-time voice AI

Hugging Face 与 Cerebras 将 Gemma 4 引入实时语音 AI

For voice AI, latency is a critical parameter. Developers have made tremendous progress in model quality, but the user experience is still often limited by response times. Hugging Face and Cerebras are changing that experience. Today, we demonstrate what becomes possible when an open, modular voice AI architecture is paired with industry-leading inference speed. The result is a speech-to-speech experience that feels dramatically more natural. Instead of waiting for an AI to respond, conversations flow with the responsiveness users expect from human interaction.

对于语音 AI 而言,延迟是一个关键参数。尽管开发者在模型质量方面取得了巨大进步,但用户体验往往仍受限于响应时间。Hugging Face 和 Cerebras 正在改变这一现状。今天,我们展示了当开放、模块化的语音 AI 架构与业界领先的推理速度相结合时,会产生怎样的可能。其结果是带来了一种感觉极其自然的语音到语音(speech-to-speech)体验。对话不再需要等待 AI 的响应,而是能够以用户期望的人际互动般的响应速度流畅进行。

Architecture: an Open, Cascaded Speech-to-Speech stack

架构:开放的级联语音到语音堆栈

The demo is built as a real-time speech-to-speech pipeline. Each part of the system is modular, open, and replaceable, making it easy for developers to adapt the stack for different assistants, robots, products, or research projects. This creates a fully open speech-to-speech loop: Speech input -> speech recognition with Nvidia’s Parakeet -> Gemma 4 VLM inference on Cerebras -> text-to-speech with Alibaba’s Qwen3TTS -> spoken response. The architecture brings together the strength of the open-source AI ecosystem: Cerebras for fast inference, Google DeepMind’s Gemma 4 31B for the language model, and Qwen for text-to-speech. Every layer can be inspected, modified, and extended by the developers.

该演示构建为一个实时语音到语音流水线。系统的每个部分都是模块化、开放且可替换的,这使得开发者能够轻松地针对不同的助手、机器人、产品或研究项目调整该堆栈。这创建了一个完全开放的语音到语音循环:语音输入 -> 使用 Nvidia Parakeet 进行语音识别 -> 在 Cerebras 上进行 Gemma 4 VLM 推理 -> 使用阿里巴巴的 Qwen3TTS 进行文本转语音 -> 语音响应。该架构汇集了开源 AI 生态系统的优势:Cerebras 提供快速推理,Google DeepMind 的 Gemma 4 31B 作为语言模型,Qwen 用于文本转语音。每一层都可以由开发者进行检查、修改和扩展。

Cerebras and Hugging Face Partnership

Cerebras 与 Hugging Face 的合作伙伴关系

Today, some production systems see a reasonable median latency while still experiencing frustrating multi-second delays at the P95. Those delays become even more noticeable when tool calls or multimodal steps require multiple turns. Cerebras helps solve one of the most important bottlenecks in the stack: the language-model response time. By making inference dramatically faster and more stable, Cerebras allows the rest of the Hugging Face pipeline to shine. That stability is especially important at the long tail. Many systems can deliver acceptable median response times, but occasional slow responses still make conversations feel unreliable.

目前,一些生产系统虽然能实现合理的平均延迟,但在 P95(95% 分位)指标下仍会出现令人沮丧的数秒延迟。当工具调用或多模态步骤需要多轮交互时,这些延迟会变得更加明显。Cerebras 帮助解决了堆栈中最关键的瓶颈之一:语言模型的响应时间。通过使推理速度显著加快且更加稳定,Cerebras 让 Hugging Face 流水线的其余部分得以充分发挥。这种稳定性在长尾场景中尤为重要。许多系统可以提供可接受的平均响应时间,但偶尔出现的缓慢响应仍会让对话显得不可靠。

Built for real-world interaction

为现实世界的交互而生

This same Hugging Face speech-to-speech pipeline already powers Reachy Mini robots, with more than 9,000 robots in the wild. For robots, voice assistants, and embodied AI, responsiveness is not a cosmetic improvement. It is what makes the interaction feel alive. The motivation to use Cerebras is therefore not simply cost reduction. It is low latency, predictable performance, and the ability to create real-time experiences that feel natural at scale. This collaboration reflects a shared belief that the future of AI will be both open and performant. Open-source models, open infrastructure, and breakthrough inference speed together create a foundation for the next generation of conversational AI. We invite developers to explore the demo, experiment with the code, and help shape what comes next for real-time voice AI.

同样的 Hugging Face 语音到语音流水线已经为 Reachy Mini 机器人提供支持,目前已有超过 9,000 台机器人投入使用。对于机器人、语音助手和具身智能而言,响应速度并非锦上添花,而是让交互变得“鲜活”的关键。因此,使用 Cerebras 的动机不仅仅是为了降低成本,更是为了实现低延迟、可预测的性能,以及在大规模场景下创造自然实时体验的能力。此次合作反映了一个共同信念:AI 的未来既是开放的,也是高性能的。开源模型、开放基础设施和突破性的推理速度共同为下一代对话式 AI 奠定了基础。我们诚邀开发者探索该演示,尝试代码,并共同塑造实时语音 AI 的未来。