Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action
Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action
欢迎 NVIDIA Cosmos 3:首个用于物理 AI 推理与行动的开源全能模型
NVIDIA Cosmos 3 is here - and it’s available on Hugging Face today. Cosmos 3 represents a major leap forward in world foundation models (WFMs) for physical AI: a single, unified omni-model that combines world generation, physical reasoning, and action generation in one model. No more juggling between different models and inference pipelines - Cosmos 3 does it all. Whether you’re building for robotics, autonomous vehicles, or smart spaces, Cosmos 3 gives you the foundation to simulate and understand the physical world.
NVIDIA Cosmos 3 现已发布,并于今日在 Hugging Face 上线。Cosmos 3 代表了物理 AI 世界基础模型(WFM)的重大飞跃:它是一个单一的、统一的全能模型,将世界生成、物理推理和行动生成整合于一体。开发者无需再在不同的模型和推理流水线之间来回切换——Cosmos 3 可以完成所有这些任务。无论您是在构建机器人、自动驾驶汽车还是智能空间,Cosmos 3 都为您提供了模拟和理解物理世界的基础。
Here’s what’s shipping with this release:
- Cosmos 3 Super and Cosmos 3 Nano on Hugging Face with model cards and licensing
- Cosmos 3 Diffusers integration for generation pipelines
- Post-training scripts for training Cosmos 3 on your own data (on GitHub)
- Open synthetic data generation (SDG) datasets for physical AI
本次发布包含以下内容:
- 在 Hugging Face 上发布的 Cosmos 3 Super 和 Cosmos 3 Nano 模型,附带模型卡和许可协议
- 用于生成流水线的 Cosmos 3 Diffusers 集成
- 用于在自有数据上训练 Cosmos 3 的后训练脚本(位于 GitHub)
- 用于物理 AI 的开源合成数据生成(SDG)数据集
SECTION 1: What’s new with Cosmos 3?
第一部分:Cosmos 3 有哪些新特性?
The biggest change in Cosmos 3 compared to previous Cosmos releases is that it’s an omni-model, built on a Mixture-of-Transformers (MoT) architecture. Previously, developers had to work with separate models for different capabilities like world generation (Cosmos Predict), controlled generation (Cosmos Transfer), scene understanding (Cosmos Reason) and policy generation (Cosmos Policy). Cosmos 3 enables all of this in a single model that can reason and generate different modalities in one unified forward pass.
与之前的 Cosmos 版本相比,Cosmos 3 最大的变化在于它是一个基于 Mixture-of-Transformers (MoT) 架构的全能模型。此前,开发者必须针对不同的功能使用独立模型,例如世界生成(Cosmos Predict)、受控生成(Cosmos Transfer)、场景理解(Cosmos Reason)和策略生成(Cosmos Policy)。Cosmos 3 在一个单一模型中实现了所有这些功能,能够在一次统一的前向传递中进行推理并生成不同的模态。
This means you can now do all this from one model:
- Generate realistic and physically plausible video worlds from text, images, videos or action inputs
- Reason about physical properties like motion, causality, and spatial relationships
- Predict future video and action sequences based on the current state
这意味着您现在可以通过一个模型完成以下所有任务:
- 从文本、图像、视频或行动输入中生成逼真且符合物理规律的视频世界
- 对运动、因果关系和空间关系等物理属性进行推理
- 基于当前状态预测未来的视频和行动序列
Why this matters for physical AI
为什么这对物理 AI 至关重要
Cosmos 3 helps build physical AI systems capable of understanding the real world. Not just pixels and tokens, but motion, causality, physics, and action. If you’re training a robot to fold laundry, building an autonomous driving simulation, or generating synthetic training data for warehouse safety scenarios, Cosmos 3 is the foundation model designed for exactly these use-cases.
Cosmos 3 有助于构建能够理解现实世界的物理 AI 系统。它处理的不仅仅是像素和标记,还包括运动、因果关系、物理规律和行动。无论您是在训练机器人折叠衣物、构建自动驾驶模拟,还是为仓库安全场景生成合成训练数据,Cosmos 3 都是专为这些用例设计的基础模型。
Architecture
架构
Cosmos 3 is built on an MoT backbone that processes all modalities - text, image, video, audio, and action - within a single unified architecture. Each modality is first encoded by a dedicated encoder (a ViT for visual understanding, a VAE for visual/audio generation, and domain-aware vectors for actions), then projected into a shared representation space. The input sequence is split into two subsequences: an autoregressive (AR) subsequence that handles reasoning and understanding via next-token prediction, and a diffusion (DM) subsequence that handles generation via iterative denoising.
Cosmos 3 构建在 MoT 主干架构之上,在单一统一架构内处理所有模态——文本、图像、视频、音频和行动。每种模态首先由专用编码器进行编码(用于视觉理解的 ViT、用于视觉/音频生成的 VAE,以及用于行动的领域感知向量),然后投影到共享的表示空间中。输入序列被分为两个子序列:一个通过下一标记预测处理推理和理解的自回归(AR)子序列,以及一个通过迭代去噪处理生成的扩散(DM)子序列。
Model Versions
模型版本
This release of Cosmos 3 includes two model sizes, optimized for different deployment scenarios:
- Cosmos 3 Nano - This is the 16B parameter model (8B reasoner and 8B generator), optimized for efficient inference. Cosmos 3 Nano is designed to run on workstation-grade compute like the RTX PRO 6000 GPU, and is available on Hugging Face at nvidia/Cosmos3-Nano.
- Cosmos 3 Super - This is the 64B parameter model (32B reasoner and 32B generator) designed for large-scale synthetic data generation (SDG) and research, and runs on NVIDIA Hopper and Blackwell GPUs. Cosmos 3 Super is available on Hugging Face at nvidia/Cosmos3-Super.
本次发布的 Cosmos 3 包含两种模型尺寸,针对不同的部署场景进行了优化:
- Cosmos 3 Nano - 这是一个 16B 参数的模型(8B 推理器和 8B 生成器),针对高效推理进行了优化。Cosmos 3 Nano 专为在 RTX PRO 6000 GPU 等工作站级计算设备上运行而设计,可在 Hugging Face (nvidia/Cosmos3-Nano) 上获取。
- Cosmos 3 Super - 这是一个 64B 参数的模型(32B 推理器和 32B 生成器),专为大规模合成数据生成(SDG)和研究而设计,可在 NVIDIA Hopper 和 Blackwell GPU 上运行。Cosmos 3 Super 可在 Hugging Face (nvidia/Cosmos3-Super) 上获取。