Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Google DeepMind 发布 DiffusionGemma：本地 AI 运行速度提升 4 倍

Another day, another AI model from Google. This time, Google DeepMind has released a new member of the Gemma 4 open model family, but it’s fundamentally different from the rest of the lineup. DiffusionGemma doesn’t generate outputs linearly like most AI models. Instead, it can produce an entire block of text in parallel. Google says this makes it faster and more efficient when running on local hardware like an Nvidia DGX or a humble gaming GPU.

又是一天，Google 又发布了一款新的 AI 模型。这一次，Google DeepMind 推出了 Gemma 4 开源模型家族的新成员，但它与该系列的其他模型有着本质区别。DiffusionGemma 不像大多数 AI 模型那样线性生成输出，而是可以并行生成整个文本块。Google 表示，这使得它在 Nvidia DGX 或普通游戏 GPU 等本地硬件上运行时，速度更快、效率更高。

Most AI models are designed to be autoregressive—they generate text left to right one token at a time. DiffusionGemma has more in common with image generation models, which start with static and then denoise it to create the desired content. This model takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others. At the end of the process, the model finalizes its token outputs in one large block—the “denoised” text canvas.

大多数 AI 模型被设计为自回归模型——即从左到右一次生成一个 token（词元）。DiffusionGemma 则与图像生成模型有更多共同点，后者从噪点开始，通过去噪来创建所需内容。该模型采用一组占位符 token，在画布上多次运行以生成可能的 token，并利用这些 token 来改进对其他 token 的估计。在过程结束时，模型将最终的 token 输出整合为一个大块——即“去噪”后的文本画布。

DiffusionGemma is fairly large in the realm of Google’s open models. It’s a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That means it should fit in the 18GB RAM allotment of a high-end GPU. In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second. With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second. That’s about four times the output of the similarly sized autoregressive Gemma models.

在 Google 的开源模型领域中，DiffusionGemma 的规模相当大。它是一个拥有 260 亿参数的混合专家模型（MoE），但在推理过程中仅激活 38 亿个参数。这意味着它应该能够适配高端 GPU 的 18GB 显存。在 RTX 5090 的测试中，DiffusionGemma 每秒可输出约 700 个 token。而在单块 Nvidia H100 AI 加速器上，DiffusionGemma 每秒可产生 1000 多个 token。这大约是同等规模自回归 Gemma 模型输出速度的四倍。

This approach to text generation shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel. Google says this offers a measurable boost in non-linear tasks like in-line editing, molecular sequencing, and mathematical graphing. The animation above shows how DiffusionGemma was tuned to solve Sudoku puzzles, which is a notoriously challenging task for standard autoregressive AI models because each token depends on future tokens. DiffusionGemma’s ability to continuously self-correct large sets of tokens makes that easier.

这种文本生成方法将瓶颈从内存带宽转移到了计算能力上，能够并行生成多达 256 个 token。Google 表示，这在非线性任务（如行内编辑、分子测序和数学绘图）中提供了显著的性能提升。上方的动画展示了 DiffusionGemma 如何被调整以解决数独难题，这对标准的自回归 AI 模型来说是一项极其艰巨的任务，因为每个 token 都依赖于后续的 token。而 DiffusionGemma 能够持续自我修正大量 token 的能力，使得这一任务变得更加简单。

Multiple paths to local efficiency

实现本地效率的多种途径

If diffusion is so much faster, why isn’t Google using it in big cloud-based Gemini models? Google has experimented with this, but there are a few drawbacks to text diffusion, including a higher error rate. In image generation models, a single badly predicted pixel doesn’t make the image useless, but language is discrete. An equivalent error in text can make a block of tokens meaningless and force you to start over to get a better output. Diffusion models also waste resources when the desired output is only a few tokens long. They have to do a lot more parallel work to whittle down to, say, five tokens that an autoregressive model does from beginning to end in just five steps.

如果扩散模型速度这么快，为什么 Google 不在大型云端 Gemini 模型中使用它呢？Google 曾对此进行过实验，但文本扩散存在一些缺点，包括更高的错误率。在图像生成模型中，单个预测错误的像素不会导致图像失效，但语言是离散的。文本中类似的错误可能会使一整块 token 变得毫无意义，并迫使你重新开始以获得更好的输出。此外，当所需的输出只有几个 token 时，扩散模型会浪费资源。它们必须进行大量的并行工作才能缩减到（例如）五个 token，而自回归模型只需五步即可从头到尾完成。

The efficiency gain for local processing makes this an appealing avenue of experimentation, though. In the cloud, autoregressive models can batch large numbers of compute jobs from multiple users so they’re always churning out tokens, and the high bandwidth memory (HBM) used in these systems can move data around much more efficiently. Conversely, local AI encounters wasted compute cycles due to lower memory bandwidth and idle time. Diffusion models can make more efficient use of available compute, but this isn’t the only way. Google also recently began implementing Multi-Token Prediction (MTP) drafters, which use otherwise wasted compute cycles to predict possible tokens to increase speed. But diffusion is even faster than the MTP versions of Gemma.

不过，本地处理带来的效率提升使其成为一个极具吸引力的实验方向。在云端，自回归模型可以批量处理来自多个用户的大量计算任务，从而保持持续的 token 输出，且这些系统中使用的高带宽内存（HBM）可以更高效地移动数据。相反，本地 AI 由于内存带宽较低和空闲时间，会产生浪费的计算周期。扩散模型可以更有效地利用现有计算资源，但这并非唯一途径。Google 最近还开始实施多 token 预测（MTP）草稿机制，利用原本浪费的计算周期来预测可能的 token 以提高速度。但扩散模型的速度甚至比 MTP 版本的 Gemma 更快。

Google stresses that DiffusionGemma is experimental, but it’s available under the same Apache 2.0 license as all the other fourth-generation Gemma models. You can download the model weights today from Hugging Face. Google says it worked with Nvidia to ensure DiffusionGemma was optimized for a variety of setups, including high-end RTX GPUs (quantized) and enterprise systems like the H100 or DGX Spark platform.

Google 强调 DiffusionGemma 仍处于实验阶段，但它与所有其他第四代 Gemma 模型一样，采用 Apache 2.0 许可证发布。你今天就可以从 Hugging Face 下载模型权重。Google 表示，他们与 Nvidia 合作，确保 DiffusionGemma 针对各种配置进行了优化，包括高端 RTX GPU（量化版）以及 H100 或 DGX Spark 平台等企业级系统。