Diffusion Language Models: An Experimental Analysis

扩散语言模型：一项实验性分析

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences.

大型语言模型（LLMs）通过自回归生成彻底改变了语言建模，在广泛的任务中实现了强大的性能。最近，扩散语言模型（DLMs）作为一种替代范式出现，它通过迭代去噪而非预测下一个标记（next-token prediction）来生成文本，从而允许对整个序列进行并行优化。

While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs.

尽管已经提出了许多基于扩散的架构，但由于评估协议、数据集、推理预算和生成超参数的差异，使得比较它们的能力并理解其权衡变得十分困难。在这项工作中，我们对现代 DLMs 进行了系统的实验分析。

Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions.

具体而言，我们评估了八个最先进的 DLM，涵盖了推理、编码、翻译、知识和结构化问题解决等八个基准测试，同时明确考虑了生成质量和计算效率。除了下游评估外，我们还分析了关键推理时间因素的影响，包括去噪步数、上下文长度、块大小和并行去掩码策略，并辅以在相同条件下训练的小型模型的受控对比实验。

Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

我们的分析突显了基于扩散的语言建模在不同任务、架构和推理预算下的优势与局限性。我们展示了 DLMs 的行为受到生成时设计选择的强烈影响，从而在性能和计算效率之间产生了明显的权衡。总的来说，我们的研究为当代 DLMs 的能力和部署特性提供了实践见解。