DiffusionGemma: 4x Faster Text Generation

DiffusionGemma: 4x Faster Text Generation

DiffusionGemma:文本生成速度提升 4 倍

Our newest open experimental model delivers up to 4x faster inference on dedicated GPUs and opens the door to exploring speed-critical, interactive local workflows. 我们最新的开源实验模型在专用 GPU 上可实现高达 4 倍的推理速度,为探索对速度要求极高的交互式本地工作流打开了大门。

Today, we’re introducing DiffusionGemma, an experimental open model that explores text diffusion, an exceptionally fast approach to text generation. Released under an Apache 2.0 license, this 26B Mixture of Experts (MoE) model moves beyond the sequential token-by-token processing of typical autoregressive Large Language Models (LLMs). Instead, it generates entire blocks of text simultaneously, delivering up to 4x faster text generation on GPUs. 今天,我们推出了 DiffusionGemma,这是一个探索文本扩散(text diffusion)的实验性开源模型,这是一种极其快速的文本生成方法。该模型采用 Apache 2.0 许可证发布,是一个 26B 参数的混合专家模型(MoE),它超越了典型自回归大语言模型(LLM)逐个 token 处理的顺序模式。相反,它能够同时生成整个文本块,在 GPU 上实现高达 4 倍的文本生成速度。

Built upon the industry-leading intelligence-per-parameter of our Gemma 4 family and cutting-edge Gemini Diffusion research, DiffusionGemma integrates a novel diffusion head designed to maximize generation speed. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures. DiffusionGemma 基于我们 Gemma 4 系列业界领先的单位参数智能水平以及前沿的 Gemini 扩散研究,集成了一个旨在最大化生成速度的新型扩散头(diffusion head)。虽然自回归 Gemma 4 模型仍然是高质量生产输出的标准,但 DiffusionGemma 专为那些探索速度敏感型交互式本地工作流的研究人员和开发者而设计,例如行内编辑、快速迭代以及生成非线性文本结构。

Unlocking new value for developers

为开发者解锁新价值

Developers building real-time interactive AI applications often struggle with the latency bottlenecks of local inference. DiffusionGemma addresses these challenges directly, with some key trade-offs: 开发实时交互式 AI 应用的开发者经常面临本地推理的延迟瓶颈。DiffusionGemma 直接解决了这些挑战,同时也带来了一些关键的权衡:

  • Blazing fast inference: By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma generates up to 4x faster token output on dedicated GPUs. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090).

  • 极速推理: 通过将解码瓶颈从内存带宽转移到计算能力上,DiffusionGemma 在专用 GPU 上生成的 token 输出速度最高可提升 4 倍。(在单张 NVIDIA H100 上每秒可生成 1000+ 个 token,在 NVIDIA GeForce RTX 5090 上每秒可生成 700+ 个 token)。

  • Accessible hardware footprint: Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.

  • 易于获取的硬件需求: 作为一个总参数量为 26B 的混合专家模型(MoE),其推理时仅激活 3.8B 参数。在量化后,DiffusionGemma 可以轻松适配高端消费级 GPU 的 18GB 显存限制。

  • Bi-directional attention: Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing, code infilling, amino acid sequences or mathematical graphs.

  • 双向注意力机制: 每次前向传播并行生成 256 个 token,使得每个 token 都能关注到所有其他 token。这为非线性领域(如行内编辑、代码补全、氨基酸序列或数学图表)提供了显著优势。

  • Intelligent self-correction: The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time.

  • 智能自我修正: 模型会迭代优化自身的输出,使其能够一次性评估整个文本块,从而实时修复错误。

Experimental status & production recommendations

实验状态与生产建议

Because it prioritizes speed and parallel layout generation, DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4. You can improve DiffusionGemma’s performance on specific tasks through fine-tuning. 由于 DiffusionGemma 优先考虑速度和并行布局生成,其整体输出质量低于标准的 Gemma 4。对于追求极致质量的应用,我们建议部署标准的 Gemma 4。你可以通过微调来提升 DiffusionGemma 在特定任务上的表现。

Why diffusion for text?

为什么文本生成要用扩散模型?

While the AI research community has explored diffusion-based text generation for years, applying it to large models has remained a challenge. DiffusionGemma changes this by shifting how models use hardware. 尽管 AI 研究界多年来一直在探索基于扩散的文本生成,但将其应用于大模型仍是一项挑战。DiffusionGemma 通过改变模型使用硬件的方式改变了这一现状。

The trade-off with traditional models: Most language models act like a typewriter, generating one token at a time from left to right. In the cloud, this is efficient because servers can batch thousands of user requests together to share the hardware load. But when run locally for a single user, this word-by-word process leaves your dedicated GPU or TPU underutilized — it spends most of its time simply waiting for the next “keystroke.” 传统模型的权衡: 大多数语言模型就像打字机一样,从左到右一次生成一个 token。在云端,这是高效的,因为服务器可以将成千上万的用户请求批量处理以分担硬件负载。但当在本地为单个用户运行时,这种逐字处理的过程会导致你的专用 GPU 或 TPU 利用率不足——它大部分时间只是在等待下一个“按键”。

DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer’s processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously. DiffusionGemma 扭转了这种低效。它不再按顺序预测单词,而是同时起草整个 256 个 token 的段落。通过一次性给处理器分配更大的工作量,DiffusionGemma 充分发挥了硬件的潜力。它将你的模型推理从单台顺序打字机升级为能够同时印制整个文本块的大型印刷机。

How text diffusion works

文本扩散的工作原理

Similar to AI image generators that start with visual static and iteratively refine it into a clear picture, DiffusionGemma applies this to text: 类似于 AI 图像生成器从视觉噪点开始并迭代细化为清晰图像的过程,DiffusionGemma 将此原理应用于文本:

  1. The canvas: The model starts with a canvas of random placeholder tokens. 画布: 模型从一个包含随机占位符 token 的画布开始。
  2. Iterative refinement: The model makes multiple passes, locking in correct tokens and using them as context clues to refine the rest. 迭代细化: 模型进行多次遍历,锁定正确的 token,并将其作为上下文线索来细化其余部分。
  3. Final polish: The text converges into high-quality output. 最终润色: 文本收敛为高质量的输出。

Because the model can process the whole paragraph while generating, it unlocks new patterns of model behavior, like perfectly closing complex markdown formatting or generating and rendering code in near real-time. 由于模型在生成时可以处理整个段落,它解锁了新的模型行为模式,例如完美闭合复杂的 Markdown 格式,或近乎实时地生成并渲染代码。

Get started today

立即开始

  • Download the weights: Access the experimental model weights (released under a permissive Apache 2.0 license) right now on Hugging Face. 下载权重: 立即在 Hugging Face 上获取实验模型权重(以宽松的 Apache 2.0 许可证发布)。
  • Integrate & learn: Learn more in our DiffusionGemma developer guide. Or deep dive into A Visual Guide to DiffusionGemma to understand the mechanics under the hood. 集成与学习: 在我们的 DiffusionGemma 开发者指南中了解更多信息。或者深入阅读《DiffusionGemma 可视化指南》以理解其底层机制。
  • Use your favorite development tools: Serve the model efficiently using MLX, vLLM (with integration supported by Red Hat), and Hugging Face Transformers. For rapid experimentation, we are releasing a fine-tuning tutorial using Hackable Diffusion, a modular JAX toolbox designed for composability. 使用你喜爱的开发工具: 使用 MLX、vLLM(由 Red Hat 提供集成支持)和 Hugging Face Transformers 高效部署模型。为了快速实验,我们发布了一个使用 Hackable Diffusion 的微调教程,这是一个专为可组合性设计的模块化 JAX 工具箱。