NVIDIA / cosmos

NVIDIA / Cosmos

Introduction

NVIDIA Cosmos is an open platform of world models, datasets, and tools that enables developers to build Physical AI for robots, autonomous vehicles, smart infrastructure, and more.

NVIDIA Cosmos 是一个包含世界模型、数据集和工具的开放平台，旨在帮助开发者为机器人、自动驾驶汽车、智能基础设施等领域构建物理人工智能（Physical AI）。

Cosmos 3

Cosmos 3 is our newest model family. It is a suite of omnimodal world models designed to jointly process and generate language, images, video, audio, and action sequences within a unified Mixture-of-Transformers architecture. By supporting highly flexible input-output configurations, it seamlessly unifies critical modalities for Physical AI — effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework.

Cosmos 3 是我们最新的模型系列。它是一套全模态世界模型，旨在统一的 Mixture-of-Transformers（MoT）架构内联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，它无缝统一了物理人工智能的关键模态，有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合进一个单一框架中。

Key Capabilities

World understanding: Analyze videos and images for captions, temporal events, next actions, spatial grounding, physical plausibility, and causal outcomes.
World generation: Produce images, videos, synchronized sound, and action-conditioned rollouts from text, image, video, or action inputs.
Action modeling: Predict policy actions, inverse dynamics, and forward dynamics for robotics, camera motion, egocentric motion, and autonomous-driving settings.

核心能力：

世界理解： 分析视频和图像，以获取字幕、时间事件、后续动作、空间定位、物理合理性及因果结果。
世界生成： 根据文本、图像、视频或动作输入，生成图像、视频、同步音频以及受动作驱动的演进过程。
动作建模： 为机器人技术、摄像机运动、自我中心运动和自动驾驶场景预测策略动作、逆动力学和正向动力学。

Model Architecture

Cosmos 3 is an omnimodal world model built on a unified Mixture-of-Transformers (MoT) architecture that combines an autoregressive (AR) transformer for reasoning with a diffusion transformer (DM) for multimodal generation. In Reasoner Mode, language and visual understanding tokens are processed through causal self-attention, enabling next-token prediction for tasks such as perception, planning, and world reasoning. In Generator Mode, noisy image, video, audio, and action tokens are denoised through full attention, allowing the model to jointly generate coherent multimodal outputs.

Cosmos 3 是一款全模态世界模型，构建于统一的 Mixture-of-Transformers (MoT) 架构之上，结合了用于推理的自回归 (AR) Transformer 和用于多模态生成的扩散 Transformer (DM)。在“推理模式”（Reasoner Mode）下，语言和视觉理解 Token 通过因果自注意力机制进行处理，从而实现感知、规划和世界推理等任务的下一 Token 预测。在“生成模式”（Generator Mode）下，带有噪声的图像、视频、音频和动作 Token 通过全注意力机制进行去噪，使模型能够联合生成连贯的多模态输出。

Model Family

Model Size	Primary Capability
Cosmos3-Nano 16B	Compact omnimodal world model for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI.
Cosmos3-Super 64B	Frontier-scale omnimodal world model for advanced multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI.

模型系列：

模型规模	主要能力
Cosmos3-Nano 16B	用于多模态理解、世界模拟、未来预测、动作推理和物理人工智能的紧凑型全模态世界模型。
Cosmos3-Super 64B	用于高级多模态理解、世界模拟、未来预测、动作推理和物理人工智能的前沿级全模态世界模型。

Supported Generation Settings

Resolution tiers: 256p, 480p, 720p (default=480p)
Aspect ratios: 16:9, 4:3, 1:1, 3:4, 9:16 (default=16:9)
Frame rates: 10, 16, 24, and 30 FPS (default=24)
Frame count: 5 to 300 frames (default=189)

支持的生成设置：

分辨率层级： 256p, 480p, 720p（默认=480p）
宽高比： 16:9, 4:3, 1:1, 3:4, 9:16（默认=16:9）
帧率： 10, 16, 24 和 30 FPS（默认=24）
帧数： 5 到 300 帧（默认=189）