Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

迈向光速文本生成:Nemotron-Labs 扩散语言模型

Large language models (LLMs) have become the default interface for code generation, math problem solving, summarization, document understanding, and many other developer workflows. Under the hood, though, many LLMs still generate text the same way: one token at a time, and each token depends on the tokens that appeared before it. As such, these models are called autoregressive, since they consume their own outputs. 大型语言模型(LLM)已成为代码生成、数学问题求解、摘要提取、文档理解以及许多其他开发者工作流程的默认接口。然而,在底层机制上,许多 LLM 仍然以相同的方式生成文本:一次生成一个 token,且每个 token 都依赖于之前出现的 token。因此,这些模型被称为自回归模型(Autoregressive),因为它们会消耗自身的输出。

That autoregressive (AR) approach has been remarkably successful. It is stable to train, simple to serve, and responsible for much of the progress in modern language modeling. But it also creates a hard limit: every new token requires a full model pass and every weight has to be loaded from the memory before computation can start. For developers building latency-sensitive applications, running smaller batch sizes, or trying to make better use of modern GPUs, token-by-token generation can leave performance on the table as most of the GPU’s time is spent on memory operations, rather than computation. 这种自回归(AR)方法取得了显著的成功。它训练稳定、部署简单,是现代语言建模取得大部分进展的原因。但它也带来了一个硬性限制:每生成一个新的 token 都需要进行一次完整的模型传递,且在计算开始前,所有权重都必须从内存中加载。对于构建延迟敏感型应用、运行较小批次或试图更好地利用现代 GPU 的开发者来说,逐个 token 生成的方式可能会浪费性能,因为 GPU 的大部分时间都花在了内存操作上,而非计算上。

Additionally, once a token is generated by an autoregressive model, it is final and they do not inherently have the ability to revise previous tokens. Consequently, mistakes can propagate during the course of generation. Nemotron-Labs Diffusion introduces a new path forward: diffusion language models (DLM) that work by generating multiple tokens in parallel, then iteratively refining the generated tokens in multiple steps. 此外,一旦自回归模型生成了一个 token,它就是最终结果,模型本身并不具备修改先前 token 的能力。因此,错误可能会在生成过程中不断累积。Nemotron-Labs Diffusion 引入了一条新的前进路径:扩散语言模型(DLM)。其工作原理是并行生成多个 token,然后通过多个步骤迭代优化这些生成的 token。

Not only can these models better leverage the computational model of the modern GPUs and offer significant runtime performance benefits, but they can also revise generated tokens, making them more suitable for revising existing text and addressing fill-in-the-middle objectives. This generate-and-refine property also offers a built-in way to control the inference budget. By reducing the number of refinement steps, one can reduce the compute requirements of these models at runtime. 这些模型不仅能更好地利用现代 GPU 的计算模型并提供显著的运行时性能优势,还能修改已生成的 token,使其更适合修订现有文本以及处理“中间填充”(fill-in-the-middle)任务。这种“生成并优化”的特性还提供了一种内置的推理预算控制方式。通过减少优化步骤的数量,可以在运行时降低这些模型的计算需求。

Three Generation Modes in One Model

单一模型中的三种生成模式

Nemotron-Labs Diffusion is designed around a simple idea: autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model. The model supports three generation modes: Nemotron-Labs Diffusion 的设计围绕一个简单的理念:自回归生成和扩散生成不应是独立的模型系列,而应是同一个模型所具备的能力。该模型支持三种生成模式:

  • Autoregressive mode runs like a standard left-to-right LLM. This keeps compatibility with the generation workflow developers already know.
  • Diffusion mode generates block by block, gradually generating tokens over multiple steps.
  • Self-speculation mode uses diffusion to draft multiple candidate tokens, then uses autoregressive decoding to verify them. This combines the speed potential of diffusion-style drafting with the reliability of AR verification.
  • 自回归模式:像标准的从左到右 LLM 一样运行。这保持了与开发者已熟悉的生成工作流程的兼容性。
  • 扩散模式:以块(block)为单位进行生成,通过多个步骤逐步生成 token。
  • 自我推测模式:利用扩散机制起草多个候选 token,然后使用自回归解码进行验证。这结合了扩散式起草的速度潜力和自回归验证的可靠性。

This flexible design is the key developer-facing feature where speed and accuracy both matter, even at workloads with unpredictable batch sizes, or those with a single query (batch size=1). Selecting the desired inference mode requires almost no change at the application level, since this is a deployment-time setting. As such, developers can seamlessly switch between the model they use today, or Nemotron-Labs Diffusion in various inference modes for ultra-fast generation speeds. 这种灵活的设计是面向开发者的关键特性,在速度和准确性同样重要的场景下(即使是在批次大小不可预测或单次查询(batch size=1)的工作负载中)尤为重要。选择所需的推理模式几乎不需要在应用层面进行任何更改,因为这是一个部署时的设置。因此,开发者可以在当前使用的模型与处于不同推理模式下的 Nemotron-Labs Diffusion 之间无缝切换,以获得超快的生成速度。

Performance Highlights

性能亮点

Nemotron-Labs Diffusion 8B achieves an improved average accuracy of 1.2% compared with Qwen3 8B. Comparing the inference speed measured in tokens per forward pass (TPF for short, a hardware-agnostic means of measuring token decoding efficiency), the diffusion mode reaches 2.6× higher TPF than AR models, while self-speculation pushes that further to 6× for linear self-speculation and 6.4× for quadratic self-speculation, with comparable accuracy across the evaluated tasks. 与 Qwen3 8B 相比,Nemotron-Labs Diffusion 8B 的平均准确率提升了 1.2%。在比较推理速度时(以每次前向传递的 token 数 TPF 为衡量标准,这是一种与硬件无关的 token 解码效率衡量方式),扩散模式的 TPF 比自回归模型高出 2.6 倍;而自我推测模式进一步将该数值提升至线性自我推测的 6 倍和二次自我推测的 6.4 倍,且在评估任务中保持了相当的准确性。