One of the First Public HiDream-O1-Image LoRAs — and How to Train Your Own

首个公开的 HiDream-O1-Image LoRA 之一及其训练指南

TL;DR HiDream-O1-Image is one of the strongest open-weight text-to-image models out right now (it debuted around #8 in the Artificial Analysis T2I Arena). But it shipped inference-only, and because its architecture is radically different from SDXL/Flux — no VAE, no separate text encoder, everything is one unified transformer — the usual LoRA trainers can’t touch it. This post is one of the first publicly documented LoRA training runs and general-purpose visual-enhancement LoRAs for HiDream-O1-Image. I’ll show why the standard trainers (kohya, ai-toolkit, SimpleTuner) don’t fit, how I reverse-engineered a working training loop from the inference code alone, and the ~150-line trainer that produces a clean aesthetic LoRA. Plus the gotchas that cost me a night.

简而言之，HiDream-O1-Image 是目前最强大的开源权重文生图模型之一（在 Artificial Analysis 的 T2I 竞技场中首次亮相即排名第八左右）。但它发布时仅支持推理，且由于其架构与 SDXL/Flux 截然不同——没有 VAE，没有独立的文本编码器，一切都在一个统一的 Transformer 中——常规的 LoRA 训练器无法对其进行训练。本文是首批公开记录的 HiDream-O1-Image LoRA 训练过程及通用视觉增强 LoRA 之一。我将展示为什么标准训练器（kohya、ai-toolkit、SimpleTuner）不适用，我是如何仅通过推理代码逆向工程出可用的训练循环，以及如何用约 150 行代码编写出能生成美观 LoRA 的训练器。此外，还有那些让我熬了一整夜的“坑”。

What this LoRA is: a general-purpose anime / semi-real visual enhancement LoRA — it improves rendering quality, lighting, and stylization across diverse subjects with a trigger phrase. It’s not a character LoRA, not a single-style LoRA, and not a model-distillation artifact.

这个 LoRA 是什么：一个通用的动漫/半写实视觉增强 LoRA——它可以通过触发词在不同主题上提升渲染质量、光影和风格化效果。它不是角色 LoRA，不是单一风格 LoRA，也不是模型蒸馏产物。

The short version of the recipe: The model’s output head predicts the clean image x0 (in patch space, [-1,1]). Build the noised input as z_t = (1 - σ)·x0 + σ·(8.0·ε) and feed the model timestep 1 - σ. Loss is just MSE(x_pred, x0) on the image-token positions. LoRA attaches via plain PEFT to the language-model decoder linears, because the backbone is a stock HF Qwen3-VL.

简要配方：模型的输出头预测的是干净图像 x0（在 patch 空间中，范围为 [-1,1]）。构建加噪输入 z_t = (1 - σ)·x0 + σ·(8.0·ε)，并将时间步 1 - σ 输入模型。损失函数仅为图像 token 位置上的 MSE(x_pred, x0)。LoRA 通过标准的 PEFT 挂载到语言模型解码器的线性层上，因为其骨干网络是原生的 Hugging Face Qwen3-VL。

Prior art (what existed before this) To set expectations honestly: I’m not claiming “world’s first LoRA file for O1.” Kijai published a ComfyUI workflow for HiDream-O1 that includes a distill LoRA — it extracts the Dev-2604 model’s behavior as a LoRA applied to the Base model. That’s a model-compression technique, not a visual-style LoRA trained on external images. Ostris (author of AI Toolkit) has run initial LoRA training tests on HiDream-O1 and ai-toolkit lists O1 as a supported model. No resulting LoRA has been publicly released as of this writing. TechnoEdge (Japanese tech media) reported using a face LoRA with HiDream-O1 Dev, though it’s unclear whether that LoRA was purpose-trained for O1 or adapted from elsewhere.

前人工作（在此之前的情况）：为了诚实地设定预期，我并不声称这是“全球首个 O1 LoRA 文件”。Kijai 发布了一个包含蒸馏 LoRA 的 HiDream-O1 ComfyUI 工作流——它将 Dev-2604 模型的行为提取为应用于基础模型的 LoRA。这是一种模型压缩技术，而不是在外部图像上训练的视觉风格 LoRA。Ostris（AI Toolkit 的作者）已经在 HiDream-O1 上进行了初步的 LoRA 训练测试，ai-toolkit 也将 O1 列为支持模型。但在撰写本文时，尚未有相关的 LoRA 公开发布。TechnoEdge（日本科技媒体）报道称在 HiDream-O1 Dev 上使用了人脸 LoRA，但不清楚该 LoRA 是专门为 O1 训练的，还是从其他地方适配的。

What I didn’t find: a publicly released, general-purpose anime / semi-real visual-enhancement LoRA trained specifically for HiDream-O1-Image. If you know of one, I’d genuinely love to see it — the more the merrier. But as of publication, this appears to be one of the first, and the first with before/after documentation and a full open training recipe.

我没有找到的是：专门为 HiDream-O1-Image 训练的、公开发布的通用动漫/半写实视觉增强 LoRA。如果你知道有这样的模型，我非常希望能看到——越多越好。但在本文发布时，这似乎是首批此类模型之一，也是第一个带有对比文档和完整开源训练配方的模型。

Why no trainer exists: the architecture. Most LoRA trainers assume the SDXL/Flux shape: a UNet/DiT denoiser + a VAE + one or two text encoders, all separate modules wired together by diffusers. You patch LoRA into the UNet/DiT attention, freeze the rest, and the trainer knows how to encode images to latents and text to embeddings. HiDream-O1-Image is a Pixel-level Unified Transformer (UiT). From its own description: a natively unified image generative foundation model built on a Pixel-level Unified Transformer without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space.

为什么没有现成的训练器：架构原因。大多数 LoRA 训练器都假设是 SDXL/Flux 的结构：一个 UNet/DiT 去噪器 + 一个 VAE + 一个或两个文本编码器，所有模块通过 diffusers 库连接在一起。你将 LoRA 注入到 UNet/DiT 的注意力机制中，冻结其余部分，训练器就知道如何将图像编码为潜空间向量，将文本编码为嵌入向量。HiDream-O1-Image 是一个像素级统一 Transformer (UiT)。根据其官方描述：这是一个原生统一的图像生成基础模型，构建于像素级统一 Transformer 之上，没有外部 VAE 或分离的文本编码器，它在单一共享的 token 空间中原生编码原始像素、文本和任务特定条件。

Concretely (reading models/qwen3_vl_transformers.py): The backbone is a Qwen3VLForConditionalGeneration — a stock Hugging Face Qwen3-VL multimodal transformer. There is no VAE. Images are patchified directly: PATCH_SIZE = 32, so an H×W image becomes (H/32)·(W/32) tokens, each a 3·32·32 = 3072-dim vector of raw pixels. A small x_embedder projects the noised patch tokens into the hidden space; a final_layer2 head projects hidden states back to patch space; a t_embedder injects the timestep at a dedicated <|tms_token|> position. It’s trained with flow matching (fm_solvers_unipc.py), and image tokens get full (bidirectional) attention while text tokens stay causal (this is what token_types controls). So none of kohya/ai-toolkit/SimpleTuner can touch it — there’s no UNet, no VAE, no separate text encoder for them to hook. That’s exactly why there are no articles: it’s a new architecture, released inference-only.

具体来说（阅读 models/qwen3_vl_transformers.py）：其骨干网络是 Qwen3VLForConditionalGeneration——一个原生的 Hugging Face Qwen3-VL 多模态 Transformer。没有 VAE。图像被直接分块（patchified）：PATCH_SIZE = 32，因此 H×W 的图像变成 (H/32)·(W/32) 个 token，每个 token 是一个 3·32·32 = 3072 维的原始像素向量。一个小型的 x_embedder 将加噪后的 patch token 投影到隐藏空间；一个 final_layer2 头将隐藏状态投影回 patch 空间；一个 t_embedder 在专门的 <|tms_token|> 位置注入时间步。它使用流匹配（flow matching, fm_solvers_unipc.py）进行训练，图像 token 获得完全（双向）注意力，而文本 token 保持因果注意力（这是由 token_types 控制的）。因此，kohya、ai-toolkit 或 SimpleTuner 都无法处理它——因为没有 UNet、没有 VAE、没有独立的文本编码器供它们挂载。这正是为什么没有相关文章的原因：这是一种新的架构，且发布时仅支持推理。

The good news: because the backbone is a plain transformers model, the LoRA adapter mechanics are trivial — PEFT injects into the nn.Linears natively. The hard part is the training loop, which the repo doesn’t ship. So let’s derive it.

好消息是：因为骨干网络是一个普通的 Transformer 模型，LoRA 适配器的机制非常简单——PEFT 可以原生注入到 nn.Linear 层中。难点在于训练循环，而官方仓库并没有提供。所以，让我们推导它。

Reverse-engineering the training forward from inference. The inference loop (models/pipeline.py:generate_image) tells you everything. Per denoising step it does roughly: sigma = step_t / 1000.0 # noise level, in (0, 1] t_pixeldit = 1.0 - sigma # what the model receives as “timestep” x_pred = model(…, vinputs=z, timestep=t_pixeldit).x_pred v = (x_pred - z) / sigma # … and -v is fed to the FM scheduler

从推理中逆向工程训练的前向过程。推理循环（models/pipeline.py:generate_image）揭示了一切。在每个去噪步骤中，它大致执行： sigma = step_t / 1000.0 # 噪声水平，在 (0, 1] 之间 t_pixeldit = 1.0 - sigma # 模型接收到的“时间步” x_pred = model(…, vinputs=z, timestep=t_pixeldit).x_pred v = (x_pred - z) / sigma # … 并且 -v 被输入到 FM 调度器中

Two facts fall out of this: x_pred is the model’s prediction of the clean image x0. Work the algebra backwards: if z_t = (1-σ)·x0 + σ·ε then (x_pred - z_t)/σ = x0 - ε = -(ε - x0), and ε - x0 is exactly the rectified-flow velocity the FlowMatch scheduler expects. Consistent ⇒ the head is x0-parameterized. The noise scale isn’t 1. Inference initializes z = NOISE_SCALE · randn with NOISE_SCALE = 8.0, while x0 lives in [-1, 1]. So the interpolation the model was trained on is z_t = (1-σ)·x0 + σ·(8.0·ε).

由此得出两个事实：x_pred 是模型对干净图像 x0 的预测。反推代数：如果 z_t = (1-σ)·x0 + σ·ε，那么 (x_pred - z_t)/σ = x0 - ε = -(ε - x0)，而 ε - x0 正是 FlowMatch 调度器所期望的修正流速度。结论一致 ⇒ 该输出头是 x0 参数化的。噪声尺度不是 1。推理初始化 z = NOISE_SCALE · randn，其中 NOISE_SCALE = 8.0，而 x0 的范围在 [-1, 1]。因此，模型训练时的插值公式为 z_t = (1-σ)·x0 + σ·(8.0·ε)。

That gives the entire training step: sigma = random.uniform(T_EPS, 1.0) eps = torch.randn_like(x0) z_t = (1.0 - sigma) * x0 + sigma * (NOISE_SCALE * eps) # NOISE_SCALE = 8.0 t = torch.tensor([1.0 - sigma]) out = gen(input_ids=ids, position_ids=pos, vinputs=z_t, timestep=t, token_types=tt) x_pred = out.x_pred[0, vinput_mask[0]] # image-token positions only loss = F.mse_loss(x_pred.float(), x0[0].float())

这给出了完整的训练步骤： sigma = random.uniform(T_EPS, 1.0) eps = torch.randn_like(x0) z_t = (1.0 - sigma) * x0 + sigma * (NOISE_SCALE * eps) # NOISE_SCALE = 8.0 t = torch.tensor([1.0 - sigma]) out = gen(input_ids=ids, position_ids=pos, vinputs=z_t, timestep=t, token_types=tt) x_pred = out.x_pred[0, vinput_mask[0]] # 仅限图像 token 位置 loss = F.mse_loss(x_pred.float(), x0[0].float())

x0 is just the image, normalized to [-1,1] and patchified with the same einops rearrange the pipeline uses for reference images. The token layout (prompt → <|boi_token|> → <|tms_token|> → image tokens) is built by reusing the pipeline’s own build_t2i_text_sample, so positions and token_types line up with what the forward expects. Uniform σ sampling and…

x0 就是图像本身，归一化到 [-1,1]，并使用 pipeline 用于参考图像的相同 einops rearrange 进行分块。Token 布局（提示词 → <|boi_token|> → <|tms_token|> → 图像 token）是通过重用 pipeline 自身的 build_t2i_text_sample 构建的，因此位置和 token_types 与前向传播所期望的一致。均匀的 σ 采样以及……