Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

使用 LoRA/DoRA 微调 NVIDIA Cosmos Predict 2.5 以实现机器人视频生成

Motivation

动机

NVIDIA Cosmos Predict 2.5 is a large-scale world model capable of generating physically plausible videos conditioned on text, images, or video clips. To adapt it to a specific domain, such as robot manipulation or a particular camera viewpoint, teams still need targeted fine-tuning. Training robot policies requires demonstration data, but collecting real-robot trajectories is slow and expensive. Generating synthetic trajectories with a fine-tuned video world model offers a scalable alternative. NVIDIA Cosmos Predict 2.5 是一个大规模世界模型,能够根据文本、图像或视频片段生成物理上合理的视频。为了将其适配到特定领域(例如机器人操作或特定的摄像机视角),团队仍然需要进行针对性的微调。训练机器人策略需要演示数据,但收集真实机器人的轨迹既缓慢又昂贵。利用微调后的视频世界模型生成合成轨迹,提供了一种可扩展的替代方案。

However, full fine-tuning of a 2B-parameter model is expensive and risks catastrophic forgetting of general knowledge. LoRA and DoRA inject small trainable adapter modules into the frozen base model, reducing memory requirements while keeping the adapter files small and portable. This makes it practical to fine-tune on a single GPU and flexibly swap adapters for different domains at inference. This guide walks through parameter-efficient fine-tuning of Cosmos Predict 2.5 with LoRA and DoRA, using the diffusers and accelerate libraries with support for both single- and multi-GPU training. We then show how to use the fine-tuned model to generate synthetic robot trajectories for downstream robot learning tasks. 然而,对一个 20 亿参数的模型进行全量微调成本高昂,且存在灾难性遗忘通用知识的风险。LoRA 和 DoRA 通过将小型可训练适配器模块注入冻结的基础模型中,在降低内存需求的同时,保持了适配器文件的小巧与便携。这使得在单张 GPU 上进行微调变得切实可行,并能在推理时针对不同领域灵活切换适配器。本指南将介绍如何使用 diffusersaccelerate 库,通过 LoRA 和 DoRA 对 Cosmos Predict 2.5 进行参数高效微调,并支持单 GPU 和多 GPU 训练。随后,我们将展示如何利用微调后的模型为下游机器人学习任务生成合成轨迹。

Requirements

环境要求

  • Python 3.10+
  • PyTorch 2.5+ with CUDA
  • diffusers (pulls in transformers and peft automatically), accelerate
  • Optional: install wandb to monitor training
  • At minimum one 80 GB GPU for single-GPU training; 8× H100s recommended for faster iteration
  • Python 3.10+
  • PyTorch 2.5+ (需支持 CUDA)
  • diffusers(会自动安装 transformerspeft)、accelerate
  • 可选:安装 wandb 以监控训练过程
  • 单 GPU 训练至少需要 80GB 显存;建议使用 8× H100 以加快迭代速度

Install dependencies on your machine: 在您的机器上安装依赖项:

pip install -U "diffusers[torch]" transformers accelerate peft wandb

Preparing Data

数据准备

After installing diffusers, navigate to examples/cosmos to explore the example code. We use the same datasets as the GR00T Dreams post-training recipe: 安装 diffusers 后,导航至 examples/cosmos 以查看示例代码。我们使用与 GR00T Dreams 后训练配方相同的数据集:

  • Training Dataset: 92 robot manipulation videos with text prompts describing pick-and-place tasks.
  • Test Dataset: 50 (prompt, image) pairs. The model should generate a video based on the input text prompt and the initial frame image.
  • 训练数据集: 92 个机器人操作视频,配有描述抓取任务的文本提示。
  • 测试数据集: 50 对(提示词,图像)。模型应根据输入的文本提示和初始帧图像生成视频。

Download and preprocess the training and test datasets using download_and_preprocess_datasets.sh: 使用 download_and_preprocess_datasets.sh 下载并预处理训练和测试数据集:

bash download_and_preprocess_datasets.sh

The resulting training dataset folder looks like this: 生成的训练数据集文件夹结构如下:

gr1_dataset/train
├── metas/
│   └── *.txt
├── videos/
│   └── *.mp4
└── metadata.csv

The eval dataset is a flat directory of paired .txt and .png files for the (prompt, image) pairs: 评估数据集是一个平铺目录,包含用于(提示词,图像)对的成对 .txt 和 .png 文件:

gr1_dataset/test
├── filename1.txt
├── filename1.png
├── filename2.txt
├── filename2.png
└── ...

Training

训练

In this section, we walk through the implementation in train_cosmos_predict25_lora.py. 在本节中,我们将介绍 train_cosmos_predict25_lora.py 中的实现。

VideoDataset

VideoDataset loads each sample as a (caption, video) pair from args.train_data_dir (gr1_dataset/train in our example). For videos longer than args.num_frames, it samples a random contiguous window of args.num_frames each epoch, enabling temporal augmentation. Internally, VideoProcessor from diffusers.video_processor resizes and normalizes the raw frames into a tensor of shape (channels, frames, height, width). VideoDatasetargs.train_data_dir(本例中为 gr1_dataset/train)加载每个样本作为 (标题, 视频) 对。对于长于 args.num_frames 的视频,它会在每个 epoch 随机采样一个连续的 args.num_frames 窗口,从而实现时间维度的数据增强。在内部,diffusers.video_processor 中的 VideoProcessor 会将原始帧调整大小并归一化为形状为 (通道, 帧数, 高度, 宽度) 的张量。

train_dataset = VideoDataset(
    dataset_dir=args.train_data_dir,
    num_frames=args.num_frames,
    video_size=[args.height, args.width],
)

Initialize Adapter

初始化适配器

Cosmos Predict 2.5 consists of three submodules: Cosmos Predict 2.5 由三个子模块组成:

  1. A VAE that encodes videos into latents
  2. A text encoder that encodes text prompts into prompt embeddings
  3. DiT for diffusion in the latent space
  4. 一个将视频编码为潜空间表示(latents)的 VAE
  5. 一个将文本提示编码为提示嵌入(prompt embeddings)的文本编码器
  6. 用于潜空间扩散的 DiT

During training, all VAE, text encoder, and DiT weights are frozen. LoRA adapters are injected into the DiT’s attention projections (to_q, to_k, to_v, to_out.0) and feedforward layers (ff.net.0.proj, ff.net.2). The trainable LoRA parameters are then upcast to float32 for numerical stability under bf16 mixed precision. 在训练期间,VAE、文本编码器和 DiT 的所有权重均被冻结。LoRA 适配器被注入到 DiT 的注意力投影层(to_q, to_k, to_v, to_out.0)和前馈层(ff.net.0.proj, ff.net.2)中。随后,可训练的 LoRA 参数被提升(upcast)为 float32,以确保在 bf16 混合精度下的数值稳定性。

from diffusers import Cosmos2_5_PredictBasePipeline
from peft import LoraConfig

pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B",
    revision="diffusers/base/post-trained",
    torch_dtype=torch.bfloat16,
)

# freeze all base weights
dit = pipe.transformer
vae = pipe.vae
text_encoder = pipe.text_encoder
dit.requires_grad_(False)
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

lora_config = LoraConfig(
    r=args.lora_rank,
    lora_alpha=args.lora_alpha,
    target_modules=['to_q', 'to_k', 'to_v', 'to_out.0', 'ff.net.0.proj', 'ff.net.2'],
    use_dora=args.use_dora, # set True to switch to DoRA
)

dit.add_adapter(lora_config)
cast_training_params(dit, dtype=torch.float32) # LoRA params in fp32

Passing use_dora=True switches to DoRA, which decomposes each weight into magnitude and direction before applying the low-rank update. No other changes to the training loop are needed. 设置 use_dora=True 将切换到 DoRA,它在应用低秩更新之前将每个权重分解为幅度和方向。训练循环无需进行其他任何更改。

Loss

损失函数

Cosmos Predict 2.5 uses rectified flow: the model is trained to predict the velocity that linearly transports a noise sample toward the original “clean” data. Concretely, at timestep t, a noisy interpolation xt = σt·noise + (1−σt)·clean is constructed at a sampled noise level σt, and the model learns to predict the target velocity noise − clean via the mean-squared errors (MSE loss). The first two frames of the video are used as conditioning, and thus no noise is added to their latents. Cosmos Predict 2.5 使用修正流(rectified flow):模型被训练用于预测将噪声样本线性传输至原始“干净”数据的速度。具体而言,在时间步 t,在采样噪声水平 σt 下构建噪声插值 xt = σt·noise + (1−σt)·clean,模型通过均方误差(MSE 损失)学习预测目标速度 noise − clean。视频的前两帧用作条件,因此它们的潜空间表示不会添加噪声。

The training loss follows the rectified flow formulation used by Cosmos Predict 2.5: 训练损失遵循 Cosmos Predict 2.5 使用的修正流公式:

# Sample timestep with logit-normal distribution
sigma_t = sample_train_sigma_t(bsz, distribution='logitnormal', device=device)

# Rectified flow interpolates between clean latent and noise
xt = noise * sigma_t + clean_latent * (1 - sigma_t)

# Conditional generation: DiT conditions on the first two frames of the video, the timestep, and the prompt embeds
# `cond_indicator` and `cond_mask` have values = 1 for the first two frames and 0 for other frames
xt = clean_latent * cond_mask + xt * (1 - cond_mask)