NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

World models (systems that synthesize realistic video sequences from an initial image and a set of actions) are becoming central to embodied AI, simulation, and robotics research. The core challenge is scaling these systems to generate minute-long, high-resolution video without requiring prohibitively large clusters for both training and inference. Most competitive open-source baselines either require multi-GPU inference or sacrifice resolution to stay within compute budgets.

NVIDIA’s SANA-WM directly targets these bottlenecks. Built on the SANA-Video codebase and available through the NVlabs/Sana GitHub repository, it is a 2.6B-parameter Diffusion Transformer (DiT) trained natively for one-minute generation at 720p with metric-scale 6-DoF camera control. It supports three single-GPU inference variants: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a few-step distilled autoregressive generator for faster deployment. The distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.

The Architecture: Four Core Design Decisions

1. Hybrid Linear Attention with Gated DeltaNet (GDN)

Standard softmax attention has memory and compute complexity that grows quadratically with sequence length — a serious problem when generating 961 latent frames for a 60-second video at 720p. SANA-Video, the predecessor, used cumulative ReLU-based linear attention, which maintains a constant-size recurrent state. However, this has no decay mechanism: all past frames accumulate with equal weight, causing drift over minute-scale sequences.

SANA-WM replaces most attention blocks with frame-wise Gated DeltaNet (GDN). Unlike token-wise GDN used in language models, SANA-WM’s frame-wise variant processes one entire latent frame per recurrent step. The GDN update rule incorporates a decay gate γ (which down-weights stale past frames) and a delta-rule correction (which updates only the residual between the target value and the current state prediction), keeping the recurrent state at a constant D×D size regardless of video length.

To stabilize training, the research team introduces an algebraic key-scaling approach: keys are scaled by 1/√(D·S), where D is the head dimension and S is the number of spatial tokens per frame. This ensures the spectral norm of the transition matrix remains bounded and eliminates the NaN divergence events observed with standard L2 key normalization (1/√D) or no scaling at all, both of which triggered NaN events at steps 16 and 1, respectively.

The final backbone interleaves 15 frame-wise GDN blocks with 5 softmax attention blocks (at layers 3, 7, 11, 15, and 19) across 20 total transformer blocks. The softmax blocks provide exact long-range recall where GDN’s recurrence alone is insufficient.

2. Dual-Branch Camera Control

Camera-controlled world modeling requires the model to faithfully follow a continuous 6-DoF trajectory, not just align with a text description of motion. SANA-WM uses two complementary branches that operate at different temporal rates:

Coarse branch (UCPE attention): Operates at the latent-frame rate. For each latent token, it computes a ray-local camera basis from the camera-to-world pose and intrinsics, then applies a Unified Camera Positional Encoding (UCPE) to the geometric channels of each attention head. This captures global trajectory structure across the full sequence.
Fine branch (Plücker mixing): Addresses a compression mismatch. Each latent token summarizes eight raw frames, each with its own distinct camera pose. The fine branch computes pixel-wise Plücker raymaps (a 6D representation: ray direction d and moment o×d) from all eight raw frames within one VAE temporal stride, packs them into a 48-channel tensor, and injects this embedding after each self-attention output via a zero-initialized projection. This restores intra-stride camera motion that the coarse branch cannot see at latent-frame resolution.

Ablations on OmniWorld show that neither branch alone matches the dual approach: UCPE-only achieves a Camera Motion Consistency (CamMC) of 0.2453, while UCPE + Plücker mixing reaches 0.2047.

3. Two-Stage Generation Pipeline

Stage-1 SANA-WM outputs, while spatiotemporally consistent, can contain structural artifacts over long sequences. A second-stage refiner, initialized from the 17B LTX-2 model with rank-384 LoRA adapters fine-tuned on paired synthetic and real video data, corrects these artifacts. It uses truncated-σ flow matching: stage-1 latents are perturbed with a large starting noise (σ_start = 0.9), and the refiner learns to map this noisy input toward the high-fidelity target. Only three Euler denoising steps are needed at inference. The refiner reduces long-horizon visual drift (ΔIQ) from 3.79 to 1.17 on the Simple-Trajectory split, and from 3.09 to 0.31 on the Hard-Trajectory split.

4. Robust Data Annotation Pipeline

Training camera-controlled video generation requires metric-scale 6-DoF pose annotations, the information not available in standard video datasets. The research team modified VIPE (a camera-pose annotation engine) by replacing its depth backend with Pi3X (for long-sequence-consistent depth) fused with MoGe-2 (for accurate per-frame metric scale). They also extended the bundle adjustment stage to treat focal lengths and principal points as per-frame variables rather than shared global intrinsics, enabling more robust annotation on internet video with varying focal lengths.

The resulting pipeline processes seven training corpus entries drawn from multiple open-source sources: SpatialVID-HQ (real, 10s clips), DL3DV real clips (10s), DL3DV GS Refined synthetic clips (60s, rendered via 3D Gaussian Splatting), OmniWorld (synthetic, 60s), Sekai Game (synthetic, 60s), Sekai Walking-HQ (real, 60s), and MiraData (real, 60s). This yields a total of 212,975 clips with metric-scale pose annotations. The LTX2-VAE used for compression is 2.0× smaller than ST-DC-AE and 8.0× smaller than Wan2.1-VAE, which directly improves training and inference efficiency.

For DL3DV, which contains static 3D scene captures rather than native one-minute videos, the research team fit one FCGS 3D Gaussian Splatting reconstruction per scene, designed diverse one-minute camera paths, rendered long videos with known intrinsics and extrinsics, and then refined the rendered outputs with DiFix3D to reduce splatting artifacts.

Training Strategy and Infrastructure

SANA-WM’s compute involves two phases on 64 H100 GPUs. First, before DiT training, the team adapts the LTX2 VAE to the SANA-Video SFT training data in approximately 50K steps, taking roughly 3.5 days. The main DiT training then follows a four-stage progressive schedule lasting approximately 15 days:

Stage 1 (~2.75 days): Adapt the pre-trained SANA-Video model to the frame-wise GDN architecture on short (5s) video clips. This replaces cumulative linear attention with the recurrent GDN blocks on a cheaper, short-horizon training regime where failure modes are easier to diagnose.
Stage 2 (~2 days): Introduce hybrid attention by replacing every fourth GDN block with a standard softmax attention block on the same short-clip setting, improving the efficiency–quality trade-off.
Stage 3 (~8 days): Extend training to 961-frame (60-second) sequences and incorporate Dual-Branch Camera Control. Context-Parallel (CP=2) sharding distributes the latent sequence across GPUs using prefix-sum composition of GDN transition matrices — a mathematically exact parallelization strategy requiring minimal communication overhead.
Stage 4 (~2.5 days): Fine-tune a chunk-causal variant for autoregressive rollout, then apply self-forcing distillation to reduce sampling to four denoising steps. Attention-sink tokens and local temporal windows are added to the softmax attention layers to keep memory and per-chunk latency constant during long rollouts.

Custom fused Triton kernels for GDN scan and gate operations contribute approximately 1.5× to 2× efficiency gains throughout training.

Benchmark Results

The research team introduces a purpose-built 60-second world-model benchmark with 80 initial scenes generated by Nano Banana Pro across four scene categories game, indoor, outdoor-city, and outdoor-nature (20 per category). Each paired with Simple and Hard camera trajectory splits. The main evaluation uses each model’s multi-step, undistilled autoregressive setting.

On this benchmark, SANA-WM with the second-stage refiner achieves the following across both splits:

Camera accuracy (Simple / Hard): Rotation error (RotErr) of 4.50° / 8.34°; Translation error (TransErr) of 1.39 / 1.39; CamMC of 1.41 / 1.44 — the best among all compared methods, including LingBot-World (14B+14B parameters, 8 GPUs) and HY-WorldPlay (8B parameters, 8 GPUs).
Visual quality: 80.62 / 81.89 VBench Overall on Simple / Hard splits, comparable to LingBot-World (81.82 / 81.89) while generating 720p outputs on a single GPU per clip.
Throughput: 22.0 videos/hour on 8 H100s with the full pipeline (refiner included), compared to 0.6 videos/hour for LingBot-World — a 36× throughput advantage.
Memory: The full pipeline fits in 74.7 GB, within the 80 GB H100 budget. Stage-1-only inference fits in 51.1 GB.
Temporal stability: After refinement, ΔIQ (imaging quality degradation from first to last 10-second window) drops to 1.17 on Simple and 0.31 on Hard, compared to 23.59 and 25.88 for HY-WorldPlay.

Marktechpost’s Visual Explainer

01 / 09 • Overview

What Is SANA-WM?

SANA-WM is an open-source world model from NVIDIA that takes a single image and a camera trajectory as input, then synthesizes a realistic 60-second, 720p video that faithfully follows that trajectory. Think of it as: one image — infinite explorable worlds.

Most world models either require large multi-GPU inference clusters or sacrifice resolution to stay within budget. SANA-WM makes minute-scale, 720p, camera-controlled generation practical — training on 64 H100 GPUs and running inference on a single GPU.

2.6B
Parameters (open-source)

720p
Native output resolution

60s
Native generation length

Key insight: SANA-WM treats efficiency as a first-class objective — not an afterthought. Its distilled variant denoises a full 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.

02 / 09 • The Problem

Why Existing World Models Fall Short

Generating a 60-second video at 720p means modeling 961 latent frames. Standard softmax attention — the default in most video diffusion models — has memory and compute that grows quadratically with sequence length. At minute scale, this runs out of memory on any single GPU.

Model	Params	Res	GPUs	Throughput
LingBot-World	14B+14B	480p	8	0.6 vids/hr
HY-WorldPlay	8B	480p	8	1.1 vids/hr
Matrix-Game 3.0	5B	720p	8	3.1 vids/hr
SANA-WM	2.6B	720p	1	24.1 vids/hr

SANA-WM solves this with four architectural designs working together: hybrid linear attention, dual-branch camera control, a two-stage refinement pipeline, and a robust data annotation pipeline.

03 / 09 • Architecture

Design 1: Hybrid Linear Attention with Gated DeltaNet (GDN)

Standard softmax attention grows quadratically with context length. SANA-Video (the predecessor) used cumulative ReLU-based linear attention — constant memory, but no decay mechanism: all past frames accumulate with equal weight, causing drift at minute scale.

SANA-WM introduces frame-wise Gated DeltaNet (GDN). Unlike token-wise GDN (used in LLMs), each recurrent step processes an entire latent frame. It adds two corrections to the recurrent state:

γDecay gate — forgets stale past-frame content by multiplying the previous state by a learned decay scalar.
βDelta-rule correction — updates only the residual between the target value and the current state prediction, not the full state.

The state stays D×D regardless of video length. To prevent gradient instability, keys are scaled by 1/√(D·S), where D is head dimension and S is spatial tokens per frame. Without this, NaN events appear at training step 1.

Final backbone: 20 transformer blocks total — 15 frame-wise GDN blocks + 5 softmax attention blocks at layers {3, 7, 11, 15, 19}. The softmax blocks anchor long-range spatial consistency where GDN alone is insufficient.

04 / 09 • Architecture

Design 2: Dual-Branch Camera Control

Camera-controlled world modeling requires faithful adherence to a continuous 6-DoF trajectory — not just text-described motion. SANA-WM uses two complementary branches operating at different temporal rates:

🌎 Coarse Branch — UCPE

Operates at latent-frame rate. Computes a ray-local camera basis from the camera-to-world pose and intrinsics. Applies Unified Camera Positional Encoding (UCPE) to the geometric channels of each attention head. Captures global 6-DoF trajectory structure across the full sequence.

📷 Fine Branch — Plücker Mixing

Addresses a compression mismatch: each latent token summarizes 8 raw frames, each with a distinct camera pose. Computes pixel-wise Plücker raymaps (a 6D representation: ray direction d and moment o×d) from all 8 raw frames per VAE temporal stride, packs them into a 48-channel tensor, and injects this after each self-attention output via a zero-initialized projection.

Camera Encoding	RotErr ↓	TransErr ↓	CamMC ↓
No control	16.93	0.2347	0.4937
Plücker only	16.02	0.2340	0.4742
UCPE only	7.73	0.1350	0.2453
UCPE + Plücker	6.21	0.1162	0.2047

05 / 09 • Architecture

Design 3: Two-Stage Generation Pipeline

Stage-1 SANA-WM outputs are spatiotemporally consistent, but can contain structural artifacts over long sequences. A dedicated second-stage refiner corrects these.

1
Initialization: Refiner starts from the 17B LTX-2 model with rank-384 LoRA adapters applied to attention (Q/K/V/O) and feed-forward projections. LoRA-only fine-tuning keeps it lightweight vs. full 17B optimization.
2
Truncated-σ flow matching: Stage-1 latents are perturbed with large starting noise (σ_start=0.9). The refiner learns to map this noisy input toward the high-fidelity target — refinement over full reconstruction.
3
Inference: Only 3 Euler denoising steps needed. LoRA adapters are merged into the distilled LTX-2 base — minimal impact on end-to-end throughput.

1.17
ΔIQ after refiner (Simple split) vs 3.79 before

0.31
ΔIQ after refiner (Hard split) vs 3.09 before

22.0
Videos/hr on 8 H100s (full pipeline)

ΔIQ = imaging-quality score in the first 10s window minus the last 10s window. Lower = less degradation over the minute.

06 / 09 • Architecture

Design 4: Robust Data Annotation Pipeline

Training camera-controlled generation requires metric-scale 6-DoF pose annotations — information not available in standard video datasets. The team modified the VIPE pose annotation engine:

Depth backend upgrade

Replaced single-frame Metric3D-Small with Pi3X (long-sequence-consistent 3D structure) fused with MoGe-2 (accurate per-frame metric scale). Fused by solving for a per-frame scale factor minimizing weighted depth error, smoothed via exponential moving average (momentum 0.99).

Per-frame intrinsics

Extended bundle adjustment to treat focal lengths and principal points as per-frame variables rather than shared global intrinsics — enabling robust annotation on internet video with varying focal lengths.

Source	Type	Duration	Clips
SpatialVID-HQ	Real	10s	158,369
DL3DV (real)	Real	10s	5,691
DL3DV (GS Refined)	Synthetic	60s	14,881
OmniWorld	Synthetic	60s	1,720
Sekai Game	Synthetic	60s	3,560
Sekai Walking-HQ	Real	60s	9,767
MiraData	Real	60s	18,987
Total	—	—	212,975

07 / 09 • Training

Progressive Training Pipeline

Training has two phases on 64 H100 GPUs. First, a VAE pre-adaptation step (~3.5 days, 50K steps) adapts the LTX2 VAE to the SANA-Video SFT data. Then the main DiT training proceeds in four progressive stages (~15 days):

1
Frame-wise GDN (~2.75 days): Adapt SANA-Video to the GDN recurrent architecture on short 5s clips. The LTX2-VAE is 2.0× smaller than ST-DC-AE and 8.0× smaller than Wan2.1-VAE, cutting token count before any attention is computed.
2
Hybrid Attention (~2 days): Replace every 4th GDN block with softmax attention on the same 5s short-clip setting to improve efficiency—quality trade-off before scaling up.
3
Minute-Scale + CamCtrl (~8 days): Extend to 961-frame (60s) sequences with Dual-Branch Camera Control. Context-Parallel (CP=2) sharding uses prefix-sum composition of GDN transition matrices — mathematically exact, minimal communication overhead.
4
SFT + Distillation (~2.5 days): Fine-tune a chunk-causal autoregressive variant on ~50K high-quality clips. Apply self-forcing distillation to reduce sampling to 4 denoising steps. Add attention-sink tokens and local temporal windows to keep softmax memory constant during long rollouts.

Efficiency: Custom fused Triton kernels for GDN scan and gate operations contribute ~1.5× to 2× throughput gains throughout all stages.

08 / 09 • Results

Benchmark Results on the 60-Second World-Model Benchmark

Evaluated on 80 scenes (game, indoor, outdoor-city, outdoor-nature) across Simple and Hard camera trajectory splits. Main table uses the multi-step, undistilled autoregressive setting.

Method	Res	GPUs	RotErr↓	TransErr↓	CamMC↓	VBench↑	Tput↑
LingBot-World	480p	8	10.47/18.99	2.01/1.65	2.05/1.81	81.82/81.89	0.6
HY-WorldPlay	480p	8	17.89/35.46	2.36/2.34	2.45/2.64	68.82/70.46	1.1
Matrix-Game 3.0	720p	8	12.96/18.79	1.83/1.67	1.92/1.82	78.53/78.79	3.1
SANA-WM+refiner	720p	1	4.50/8.34	1.39/1.39	1.41/1.44	80.62/81.89	22.0

Values shown as Simple/Hard split. RotErr in degrees. Tput = videos/hour on 8 H100s. Full pipeline memory: 74.7 GB — within the 80 GB H100 budget.

Best Camera Accuracy
36× Higher Throughput vs LingBot-World
720p on 1 GPU
Comparable Visual Quality

09 / 09 • Access

How to Access SANA-WM

SANA-WM is open-source and available through the NVlabs/Sana GitHub repository (Apache 2.0 license for code; individual dataset and weight licenses vary — see Table 11 of the paper). The repo also hosts SANA, SANA-1.5, SANA-Sprint, and SANA-Video.

# Clone the repo
git clone https://github.com/NVlabs/Sana.git
cd Sana && ./environment_setup.sh sana

Three Inference Variants

▶ Bidirectional — high-quality offline synthesis (best quality, 49.2 GB)

▶ Chunk-causal AR — sequential rollout for streaming (51.1 GB)

▶ Distilled AR + NVFP4 — 34s per 60s clip on RTX 5090

Resources

📄 Paper: arXiv:2605.15178

🌎 Project page: nvlabs.github.io/Sana/WM/

📊 GitHub: github.com/NVlabs/Sana

🤔 Limitations: no explicit 3D scene memory; can drift in dynamic scenes or rare viewpoints

Practical workflow suggested by the authors: Search trajectories efficiently with the stage-1 model, then selectively refine promising rollouts with the second-stage refiner for higher fidelity.

Key Takeaways

NVIDIA’s SANA-WM generates 60-second, 720p, camera-controlled videos on a single GPU — trained in ~18.5 days on 64 H100s with only 212,975 public video clips.
A hybrid Gated DeltaNet + softmax attention backbone keeps the recurrent state at a constant D×D size regardless of video length, solving the memory explosion that makes minute-scale generation impractical with standard softmax attention.
Dual-branch camera control — UCPE at the latent-frame rate and Plücker mixing at the raw-frame rate — brings CamMC down to 0.2047, the best among all compared methods including models 5× larger.
A second-stage refiner initialized from 17B LTX-2 with rank-384 LoRA cuts long-horizon visual drift (ΔIQ) from 3.09 to 0.31 on Hard trajectories using just 3 Euler denoising steps.
At 22.0 videos/hour on 8 H100s, SANA-WM + refiner delivers 36× higher throughput than LingBot-World (14B+14B, 8 GPUs) at comparable VBench visual quality scores.

Check out the Paper, GitHub Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link