1-Bit Bonsai Image 4B Image Generation for Local Devices

1-Bit Bonsai Image 4B: Image Generation for Local Devices

Introducing 1-bit and Ternary Bonsai Image 4B: Image Generation for Local Devices May 26, 2026 • PrismML

隆重推出 1-bit 和三进制 Bonsai Image 4B:面向本地设备的图像生成模型 2026年5月26日 • PrismML

Today we’re releasing Bonsai Image 4B, a family of compact image-generation models designed to run high-quality diffusion inference on local hardware: from laptops to phones. 今天,我们发布了 Bonsai Image 4B,这是一系列紧凑型图像生成模型,旨在从笔记本电脑到手机等本地硬件上运行高质量的扩散推理。

Bonsai Image 4B comes in two variants: Bonsai Image 4B 提供两个版本:

1-bit Bonsai Image 4B uses binary {−1, +1} transformer weights with an FP16 group-wise scaling factor, giving 1.125 effective bits per weight. It targets maximum compression and is the right fit when memory pressure, bandwidth, and the deployment footprint are the primary constraints. 1-bit Bonsai Image 4B 使用二进制 {−1, +1} Transformer 权重,并配合 FP16 分组缩放因子,每个权重有效位数为 1.125 位。它旨在实现最大程度的压缩,非常适合内存压力、带宽和部署空间受限的场景。

Ternary Bonsai Image 4B uses {−1, 0, +1} transformer weights with an FP16 group-wise scaling factor, giving 1.71 effective bits per weight. The additional zero state gives the model more representational flexibility, improving visual quality and prompt fidelity while remaining extremely compact. Ternary Bonsai Image 4B 使用 {−1, 0, +1} 三进制 Transformer 权重,并配合 FP16 分组缩放因子,每个权重有效位数为 1.71 位。额外的零状态赋予了模型更强的表示灵活性,在保持极高紧凑性的同时,提升了视觉质量和提示词遵循度。

The result is a new deployment regime for image generation: capable outputs, open weights, and practical local inference on devices that were previously out of reach for this class of model. To our knowledge, Bonsai Image 4B is the first image model in its parameter class to run directly on an iPhone. 其结果是开启了图像生成部署的新范式:在以往无法运行此类模型的设备上,实现了高质量输出、开放权重以及实用的本地推理。据我们所知,Bonsai Image 4B 是同参数量级中首个能直接在 iPhone 上运行的图像模型。

Built for local generation

专为本地生成而构建

Local image generation starts with a hard constraint: the model has to fit within the device’s memory budget. 本地图像生成始于一个硬性约束:模型必须适配设备的内存预算。

For a 4B-class image model, the diffusion transformer is the largest part of the model and the part that runs repeatedly during generation. Each denoising step invokes the transformer again, so transformer size directly shapes memory pressure, bandwidth demand, and local inference speed. 对于 4B 参数量级的图像模型,扩散 Transformer 是模型中最大的部分,也是生成过程中反复运行的部分。每个去噪步骤都会再次调用 Transformer,因此 Transformer 的大小直接决定了内存压力、带宽需求和本地推理速度。

Bonsai Image 4B is built from the FLUX.2 Klein 4B. It keeps the architecture intact but changes how the transformer weights are represented. By moving those weights into binary and ternary form, Bonsai reduces the part of the image pipeline that matters most for local deployment. Bonsai Image 4B 基于 FLUX.2 Klein 4B 构建。它保持了原有的架构,但改变了 Transformer 权重的表示方式。通过将这些权重转换为二进制和三进制形式,Bonsai 缩减了图像流水线中对本地部署至关重要的部分。

The binary layers provide roughly a 14x reduction relative to full-precision transformer weights. A small set of precision-sensitive supporting tensors (~5%), called the projection layers, remains in FP16 so the final 1-bit Bonsai Image 4B transformer is 0.93 GB: an 8.3x reduction from the 7.75 GB full-precision FLUX.2 Klein 4B. 二进制层相对于全精度 Transformer 权重实现了约 14 倍的压缩。一小部分对精度敏感的支撑张量(约 5%),即投影层,保留为 FP16 格式,因此最终的 1-bit Bonsai Image 4B Transformer 大小为 0.93 GB,较 7.75 GB 的全精度 FLUX.2 Klein 4B 减少了 8.3 倍。

The ternary variant follows the same structure. Its ternary layers provide roughly a 10x reduction and the final Ternary Bonsai Image 4B transformer is 1.21 GB, a 6.4x reduction from the full-precision transformer. It is slightly larger than the 1-bit model, but the additional zero state improves visual quality and prompt fidelity. 三进制版本遵循相同的结构。其三进制层提供了约 10 倍的压缩,最终的 Ternary Bonsai Image 4B Transformer 大小为 1.21 GB,较全精度 Transformer 减少了 6.4 倍。它比 1-bit 模型稍大,但额外的零状态提升了视觉质量和提示词遵循度。

Including the compressed text encoder and FP16 VAE, the Apple Silicon deployment payload is 3.42 GB for 1-bit Bonsai Image 4B and 3.88 GB for Ternary Bonsai Image 4B. For comparison, the full precision FLUX.2 Klein 4B requires a deployment payload of 15.97 GB. 包含压缩后的文本编码器和 FP16 VAE,1-bit Bonsai Image 4B 在 Apple Silicon 上的部署包大小为 3.42 GB,Ternary Bonsai Image 4B 为 3.88 GB。相比之下,全精度的 FLUX.2 Klein 4B 需要 15.97 GB 的部署包。

Since, at runtime, the text encoder is offloaded after prompt encoding, the mean memory usage is smaller than the total payload. When generating a 512x512 image, the mean-active memory is 1.5 GB and 1.96 GB, for the binary and ternary models, compared to 11.74 GB for the original FLUX.2 Klein 4B (a reduction of 7.8x and 6.0x, respectively). 由于在运行时,文本编码器在提示词编码完成后会被卸载,因此平均内存占用小于总部署包大小。在生成 512x512 图像时,二进制和三进制模型的平均活跃内存分别为 1.5 GB 和 1.96 GB,而原始 FLUX.2 Klein 4B 为 11.74 GB(分别减少了 7.8 倍和 6.0 倍)。

This reduction in memory footprint changes where the model can run. Our deployment stack supports Apple Silicon iPhones, iPads and Macs and CUDA GPUs, using MLX low-bit paths on Apple hardware and Gemlite low-bit GEMM kernels on CUDA. On iPhone 17 Pro Max, the full-precision FLUX.2 Klein 4B pipeline does not fit within the device memory budget, while both Bonsai Image variants run on-device. 内存占用的减少改变了模型的运行环境。我们的部署栈支持 Apple Silicon iPhone、iPad、Mac 以及 CUDA GPU,在 Apple 硬件上使用 MLX 低位路径,在 CUDA 上使用 Gemlite 低位 GEMM 内核。在 iPhone 17 Pro Max 上,全精度的 FLUX.2 Klein 4B 流水线无法适配设备内存预算,而两种 Bonsai Image 版本均可在设备上运行。

In practice, Bonsai Image 4B generates a 512x512 image in 9.4 seconds on an iPhone 17 Pro Max and about 6 seconds on Mac M4 Pro. On Mac M4 Pro, Bonsai Image 4B is up to 5.6x faster than the stock full-precision MFLUX pipeline. 在实际应用中,Bonsai Image 4B 在 iPhone 17 Pro Max 上生成 512x512 图像仅需 9.4 秒,在 Mac M4 Pro 上约需 6 秒。在 Mac M4 Pro 上,Bonsai Image 4B 的速度比原版全精度 MFLUX 流水线快 5.6 倍。

Benchmarking performance

性能基准测试

Compression only matters if the model remains useful. We evaluated Bonsai Image 4B across three complementary benchmarks: GenEval for object composition and attribute binding; HPSv3 human preference and aesthetic quality; DPG-Bench dense prompt following and semantic faithfulness. 压缩只有在模型依然好用时才有意义。我们通过三个互补的基准测试评估了 Bonsai Image 4B:用于对象组合和属性绑定的 GenEval;用于人类偏好和审美质量的 HPSv3;以及用于密集提示词遵循和语义忠实度的 DPG-Bench。

Ternary Bonsai Image 4B is the quality-oriented variant. At 1.21 GB, it retains 95% of the FLUX.2 Klein 4B accuracy across GenEval, HPSv3, and DPG-Bench, while reducing the diffusion transformer footprint by 6.4x. Ternary Bonsai Image 4B 是面向质量的版本。在 1.21 GB 的体积下,它在 GenEval、HPSv3 和 DPG-Bench 测试中保留了 FLUX.2 Klein 4B 95% 的准确率,同时将扩散 Transformer 的占用空间减少了 6.4 倍。

1-bit Bonsai Image 4B is the footprint-oriented variant. It brings the diffusion transformer below 1 GB, an 8.3x reduction, while still delivering strong benchmark scores across the same three evaluations (it retains 88% of the accuracy of FLUX.2 Klein 4B). 1-bit Bonsai Image 4B 是面向空间占用的版本。它将扩散 Transformer 的大小降至 1 GB 以下(减少了 8.3 倍),同时在上述三项评估中依然表现出色(保留了 FLUX.2 Klein 4B 88% 的准确率)。

Together, the two variants move the quality–footprint frontier. Bonsai Image remains competitive with modern 4B-class image models while using a fraction of their diffusion-transformer footprint. 总之,这两个版本推动了“质量-空间占用”边界的进步。Bonsai Image 在保持与现代 4B 级图像模型竞争力的同时,仅使用了它们一小部分的扩散 Transformer 空间。