Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

Apple M3 Ultra 上实时扩散模型推理的系统性优化

Abstract: While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation.

摘要： 尽管基于扩散模型的实时图像生成在 NVIDIA GPU 上已取得飞速进展，但针对 Apple Silicon 等非 CUDA 平台的系统性优化研究仍然非常有限。在本研究中，我们针对 Apple M3 Ultra（60 核 GPU，512 GB 统一内存）进行了 10 个阶段的全面优化实验，旨在实现实时的摄像头图像到图像（img2img）转换。

We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution.

我们探索了多种技术，包括 CoreML 转换、量化、Token 合并（Token Merging）、神经引擎（Neural Engine）利用、紧凑模型探索、帧插值、基于 kNN 搜索的合成、pix2pix-turbo、光流帧跳过以及知识蒸馏，并定量评估了每种方法的有效性。最终，通过将蒸馏专用模型 SDXS-512 的 CoreML 转换与 3 线程摄像头流水线相结合，我们在 512x512 分辨率下实现了 22.7 FPS 的实时摄像头 img2img 转换。

The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon’s unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs — including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models — and provide practical guidelines for diffusion model inference on Apple Silicon.

本工作的主要贡献在于系统性地证明了：针对 CUDA 建立的优化经验在 Apple Silicon 的统一内存架构上并不一定有效。我们揭示了一个与 NVIDIA GPU 截然不同的优化图景——包括量化无法带来加速、并行推理效果不佳，以及神经引擎不适用于大规模模型等——并为 Apple Silicon 上的扩散模型推理提供了实践指南。