NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA AI team have released Cosmos 3. It is a family of omnimodal world models for physical AI. The models combine physical reasoning, world generation, and action generation. All three capabilities live inside one open model. NVIDIA open sourced the checkpoints, training scripts, deployment tools, and datasets. The Cosmos 3 release targets robotics, autonomous vehicles, and warehouse monitoring teams.

NVIDIA Cosmos 3

Physical AI systems must understand the world before acting in it. Robots and vehicles need to perceive, predict, and then act. Earlier Cosmos releases split these jobs across separate models. Cosmos 3 unifies them with a Mixture-of-Transformers (MoT) architecture. The architecture is built around two towers.

The reasoner tower is a vision-language model (VLM). It interprets images, videos, and text using an autoregressive architecture. It understands motion, object interactions, and other physical context. NVIDIA team describes this tower as the model’s brain.

The generator tower produces future observations and action sequences. It uses a diffusion-based process for physics-aware video and actions. These outputs are conditioned on the reasoner tower’s understanding. Information flows one way, from reasoner to generator. The reasoner can run alone. The generator always activates both towers for guided generation.

A single model can therefore handle reasoning and generation together.

https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3

The Model Family

NVIDIA team describes three model scales: Edge, Nano, and Super. Each uses the dual-tower Mixture-of-Transformers design. The two towers are initialized from pre-trained Qwen3-VL weights. That roughly doubles the parameter count of the backbone transformer.

Cosmos3-Nano is a 16B model built on a dense 8B transformer. It adapts the Qwen3-VL 8B architecture. Nano targets efficient inference on workstation GPUs. It runs on hardware like the NVIDIA RTX PRO 6000. That suits real-time robotics and on-device physical AI.

Cosmos3-Super is a 64B model built on a dense 32B transformer. It adapts the Qwen3-VL 32B architecture. Super targets datacenter GPUs, including NVIDIA Hopper and Blackwell. It fits large-scale synthetic data generation and advanced reasoning.

This release ships Nano and Super, along with task-specific variants. These include Super Text2Image, Super Image2Video, and Nano-Policy-DROID.

How the Unified Design Works

Both towers share one transformer architecture and a joint attention operator. They use a 3D multimodal rotary position embedding (mRoPE). mRoPE aligns video, audio, and action tokens on one temporal axis. In Reasoner Mode, tokens pass through causal self-attention. This enables next-token prediction for perception, planning, and reasoning. In Generator Mode, noisy tokens are denoised through full attention. The autoregressive tokens are never updated by the diffusion tokens.

The model treats action as a core modality with dedicated action tokens. Supported inputs include text, image, video, and JSON action arrays. Outputs include images, video, synchronized sound, action states, and text. The reasoner follows Qwen3-VL-compatible message conventions for vision inputs.

Generation supports 256p, 480p, and 720p resolution tiers. Frame counts range from 5 to 300, defaulting to 189. That equals about 7.9 seconds of video at 24 FPS. Sound is generated as stereo AAC at 48 kHz. Action conditioning spans camera, vehicle, egocentric, single-arm, dual-arm, and humanoid embodiments. Each embodiment uses a fixed action dimension, such as 9D for cameras.

The Benchmark Case

NVIDIA team evaluated Cosmos 3 across reasoning and generation suites. On reasoning, Super and Nano lead VANTAGE-Bench at their respective tiers. VANTAGE-Bench tests VLMs on real-world fixed-camera footage. It covers warehouses, transportation, and smart spaces. Cosmos 3 also tops the Traffic Anomaly Reasoning (TAR) leaderboard. TAR is the official leaderboard for AI City Challenge 2026 Track 3.

On generation, NVIDIA reports open-source state-of-the-art results. Cosmos 3 is the open-source SOTA on R-Bench. It also leads PAI-Bench, Physics-IQ, and RoboLab on public leaderboards. On Artificial Analysis, it leads two open-source leaderboards. These cover text-to-image and image-to-video without audio.

NVIDIA team also introduced its Cosmos Human Evaluation framework, called HUE. HUE decomposes each generated video into yes/no fact questions. It scores four dimensions across seven physical AI domains. The dimensions are semantic alignment, physical laws, geometric reasoning, and visual integrity. A VLM pipeline drafts the questions, and human experts refine them.

Marktechpost’s Visual Explainer

marktechpost@guide ~ /nvidia/cosmos-3
01 / 09

DEVELOPER GUIDE · PHYSICAL AI

NVIDIA Cosmos 3

Open omnimodal world models for physical AI.

Released May 31, 2026. One model for physical reasoning, world generation, and action generation.

Mixture-of-Transformers
Open weights
OpenMDW-1.1

Use ← → or swipe to navigate

01 · WHAT IT IS

A unified model for understanding and generation

Cosmos 3 is a family of omnimodal world models for physical AI. Earlier Cosmos releases split jobs across separate models. Cosmos 3 unifies them in a single open model.

Physical reasoning over images, video, and text.
World generation of physics-aware video and sound.
Action generation for robots and autonomous systems.

Subsumes VLMs, video generators, world simulators, and world-action models.

02 · ARCHITECTURE

Two towers, one transformer

REASONER TOWER

An autoregressive vision-language model (VLM). It interprets motion, object interactions, and physical context. NVIDIA calls it the model’s brain.

GENERATOR TOWER

A diffusion-based path for physics-aware video and actions. It is conditioned on the reasoner’s understanding.

Information flows one way, reasoner → generator. Both towers share a 3D multimodal RoPE (mRoPE).

03 · MODEL FAMILY

Pick a size for your hardware

Cosmos3-Nano
16B total (dense 8B, Qwen3-VL 8B). Workstation GPUs like RTX PRO 6000. Real-time robotics.

Cosmos3-Super
64B total (dense 32B, Qwen3-VL 32B). Datacenter Hopper and Blackwell GPUs. Large-scale SDG.

Cosmos3-Edge
4B total (dense 2B). On-device scale. Planned for a later release.

Plus variants: Super-Text2Image, Super-Image2Video, and Nano-Policy-DROID.

04 · MODALITIES

Inputs, outputs, and generation settings

Inputs: text, image, video, and JSON action arrays.
Outputs: image, video, synchronized sound, action states, text.
Resolution: 256p, 480p, 720p. Sound: stereo AAC at 48 kHz.
Length: 5 to 300 frames; default 189 (about 7.9s at 24 FPS).
Embodiments: camera, vehicle, egocentric, single-arm, dual-arm, humanoid.

05 · BENCHMARKS

What NVIDIA reports

REASONING

Nano and Super lead VANTAGE-Bench at their tiers. Cosmos 3 tops TAR, the AI City Challenge 2026 Track 3 leaderboard.

GENERATION

Open-source SOTA on R-Bench. Leads PAI-Bench, Physics-IQ, and RoboLab. Top open-source on Artificial Analysis text-to-image and image-to-video.

HUE evaluates videos with yes/no fact checks across four dimensions and seven domains.

06 · OPEN RELEASE

Everything ships open

Checkpoints for Nano, Super, and task-specific variants.
Six SDG datasets: robotics, physics, spatial reasoning, human motion, driving, warehouses.
Training recipes: SFT plus action post-training.
Action modes: forward dynamics, inverse dynamics, and policy generation.
License: OpenMDW-1.1.

07 · DEPLOYMENT

Run it in production

NIM microservices: Reasoner NIM available now; Generator NIM later.
Quantization: BF16, FP8, and NVFP4. NVFP4 gives up to 2x speedup.
Serving: the Reasoner NIM stack is built on vLLM.
Efficient Video Sampling (EVS): prunes redundant video tokens at inference.

Use Diffusers and Transformers for research; vLLM-Omni and vLLM for serving.

08 · LIMITATIONS & START

Know the caveats, then build

Outputs can show temporal inconsistency, unstable motion, object morphing, inaccurate 3D structure, and sound-video misalignment. Safety-critical control needs validation, guardrails, and system-level analysis.

GitHubgithub.com/nvidia/cosmos

Hugging Facehuggingface.co/collections/nvidia/cosmos3

Key Takeaways

Cosmos 3 is NVIDIA’s open family of omnimodal world models, unifying physical reasoning, world generation, and action generation in one model.
A two-tower Mixture-of-Transformers design pairs an autoregressive VLM reasoner with a diffusion generator, conditioned one-way from reasoner to generator.
Two checkpoints ship now: Cosmos3-Nano (16B, dense 8B backbone) for workstations and Cosmos3-Super (64B, dense 32B backbone) for datacenters.
NVIDIA open sourced the checkpoints, six SDG datasets, training recipes, and the HUE benchmark under the OpenMDW-1.1 license.
It reports open-source SOTA on R-Bench and leading Artificial Analysis text-to-image and image-to-video results.

Check out the Model Weights, GitHub Repo, Project Page and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link