Efficient On-Device Diffusion LLM Inference with Mobile NPU

利用移动端 NPU 实现高效的端侧扩散大模型推理

Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency-sensitive mobile inference. However, repeated denoising introduces substantial computation on smartphones. 扩散大语言模型（dLLMs）通过并行去噪多个 Token 来加速生成过程，这使其在对延迟敏感的移动端推理场景中极具吸引力。然而，反复的去噪过程在智能手机上带来了巨大的计算压力。

Mobile neural processing units (NPUs) offer high-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per-block effective workloads, token revision complicates KV cache reuse, and limited NPU-visible address space incurs costly remapping and data transfer overheads. 移动端神经网络处理器（NPU）提供了高吞吐量的稠密矩阵计算能力，但如何高效利用它们仍面临挑战：Token 提交（token commitment）会缩减每个分块的有效工作负载，Token 修订（token revision）增加了 KV 缓存重用的复杂性，且有限的 NPU 可见地址空间会导致昂贵的重映射和数据传输开销。

In this paper, we propose this http URL, the first NPU-aware inference framework for accelerating dLLMs on smartphones. this http URL aligns block-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques. 在本文中，我们提出了 this http URL，这是首个旨在加速智能手机端 dLLM 推理的 NPU 感知推理框架。this http URL 通过以下三种技术，使分块式 dLLM 推理与移动端 NPU 的执行特性相匹配。

(1) Multi-Block Speculative Decoding fills the shrinking workload in late-stage current-block decoding with speculative future-block tokens. (1) 多块投机解码（Multi-Block Speculative Decoding）：利用未来分块的投机 Token 来填充当前分块解码后期缩减的工作负载。

(2) Dual-Path Progressive Revision keeps committed tokens revisable until stable and refreshes unstable tokens through a CPU-side path without stalling dense NPU execution. (2) 双路径渐进式修订（Dual-Path Progressive Revision）：在 Token 稳定前保持其可修订状态，并通过 CPU 端路径刷新不稳定 Token，从而避免阻塞 NPU 的稠密计算。

(3) Swap-Optimized Memory Runtime compacts NPU-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads. (3) 交换优化内存运行时（Swap-Optimized Memory Runtime）：压缩 NPU 可见地址布局，并将数据暂存与 NPU 计算重叠，以减少重映射和传输开销。

We implement this http URL as an end-to-end framework and evaluate it across diverse hardware platforms and dLLM workloads. this http URL reduces LLaDA-8B generation latency by 17x-42x over the CPU baseline with prefix KV cache reuse, while preserving generation quality. 我们将 this http URL 实现为一个端到端的框架，并在多种硬件平台和 dLLM 工作负载上进行了评估。在保持生成质量的同时，与带有前缀 KV 缓存重用的 CPU 基准相比，this http URL 将 LLaDA-8B 的生成延迟降低了 17 倍至 42 倍。