Discrete Diffusion Language Models for Interactive Radiology Report Drafting

用于交互式放射学报告撰写的离散扩散语言模型

Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive.

扩散语言模型通过双向去噪标记画布（token canvas）来生成文本，而非从左至右逐个输出标记，目前已具备与自回归（AR）生成模型相抗衡的能力。然而，目前的医学基础模型几乎仍完全依赖于自回归架构。

We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster.

我们适配了一种混合专家（MoE）扩散语言模型 DiffusionGemma-26B，并在相同的 LoRA 配置下，将其与同等规模的自回归模型 Gemma-4-26B 在医学视觉问答数据集上进行了基准测试，并由具备冗余鲁棒性的 LLM 评估器进行评分。结果显示，扩散模型在所有测试中均达到或超过了自回归模型，且微调后的模型（激活参数量为 3.8B）足以与前沿的视觉语言模型竞争；其解码速度也提升了 3.5 至 4.4 倍。

Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.

除了性能相当之外，扩散模型还提供了一种自回归模型所不具备的撰写能力：任意顺序填充（any-order infill）。由于画布是双向去噪的，放射科医生可以先固定报告中的片段，再让模型填充片段之间的文本。这种操作是扩散模型的固有特性，而自回归模型在处理此类任务时表现不佳。这一特性非常契合实际的放射学报告撰写场景，因为这些报告在不同临床医生和机构之间往往存在简洁度不一或格式不统一的问题。