Efficient On-Device Diffusion LLM Inference with Mobile NPU
Efficient On-Device Diffusion LLM Inference with Mobile NPU
利用移动端 NPU 实现高效的端侧扩散大模型推理
Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency-sensitive mobile inference. However, repeated denoising introduces substantial computation on smartphones. 扩散大语言模型(dLLMs)通过并行去噪多个 Token 来加速生成过程,这使其在对延迟敏感的移动端推理场景中极具吸引力。然而,反复的去噪过程在智能手机上带来了巨大的计算压力。
Mobile neural processing units (NPUs) offer high-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per-block effective workloads, token revision complicates KV cache reuse, and limited NPU-visible address space incurs costly remapping and data transfer overheads. 移动端神经网络处理器(NPU)提供了高吞吐量的稠密矩阵计算能力,但如何高效利用它们仍面临挑战:Token 提交(token commitment)会缩减每个分块的有效工作负载,Token 修订(token revision)增加了 KV 缓存重用的复杂性,且有限的 NPU 可见地址空间会导致昂贵的重映射和数据传输开销。
In this paper, we propose this http URL, the first NPU-aware inference framework for accelerating dLLMs on smartphones. this http URL aligns block-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques. 在本文中,我们提出了 this http URL,这是首个旨在加速智能手机端 dLLM 推理的 NPU 感知推理框架。this http URL 通过以下三种技术,使分块式 dLLM 推理与移动端 NPU 的执行特性相匹配。
(1) Multi-Block Speculative Decoding fills the shrinking workload in late-stage current-block decoding with speculative future-block tokens. (1) 多块投机解码(Multi-Block Speculative Decoding):利用未来分块的投机 Token 来填充当前分块解码后期缩减的工作负载。
(2) Dual-Path Progressive Revision keeps committed tokens revisable until stable and refreshes unstable tokens through a CPU-side path without stalling dense NPU execution. (2) 双路径渐进式修订(Dual-Path Progressive Revision):在 Token 稳定前保持其可修订状态,并通过 CPU 端路径刷新不稳定 Token,从而避免阻塞 NPU 的稠密计算。
(3) Swap-Optimized Memory Runtime compacts NPU-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads. (3) 交换优化内存运行时(Swap-Optimized Memory Runtime):压缩 NPU 可见地址布局,并将数据暂存与 NPU 计算重叠,以减少重映射和传输开销。
We implement this http URL as an end-to-end framework and evaluate it across diverse hardware platforms and dLLM workloads. this http URL reduces LLaDA-8B generation latency by 17x-42x over the CPU baseline with prefix KV cache reuse, while preserving generation quality. 我们将 this http URL 实现为一个端到端的框架,并在多种硬件平台和 dLLM 工作负载上进行了评估。在保持生成质量的同时,与带有前缀 KV 缓存重用的 CPU 基准相比,this http URL 将 LLaDA-8B 的生成延迟降低了 17 倍至 42 倍。