NVIDIA has released SANA-WM, a 2.6 billion parameter open-source world model that generates 720p, 1-minute videos with 6-DoF camera control on a single GPU. The model achieves visual quality comparable to large-scale industrial baselines like LingBot-World and HY-WorldPlay while significantly improving efficiency, completing training in just 15 days on 64 H100 GPUs using only 213K public video clips.
Hybrid Linear Attention Enables Efficient Long-Context Video Generation
SANA-WM's technical architecture features four core innovations. The Hybrid Linear Attention mechanism combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. A Dual-Branch Camera Control system ensures precise 6-DoF trajectory adherence throughout video sequences. The Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. Finally, a Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels.
Consumer Hardware Deployment on Single RTX 5090
The model demonstrates remarkable efficiency metrics that make it accessible for research and development. SANA-WM generates each 60-second clip on a single GPU, with its distilled variant deployable on a single RTX 5090 using NVFP4 quantization. On consumer hardware, the model can denoise a 60-second 720p clip in just 34 seconds, bringing world modeling capabilities to researchers without access to data center infrastructure.
New Baseline for World Modeling and Embodied AI
SANA-WM represents a new baseline for World Modeling and Embodied AI applications. Unlike previous approaches requiring massive computational resources, the model's ability to run on consumer hardware democratizes access to world modeling research. The combination of precise camera control, long-sequence generation, and efficient training makes SANA-WM suitable for robotics simulation, autonomous driving research, and virtual environment generation.
The project includes full documentation at the official NVIDIA Labs project page, with the complete technical paper available on arXiv. The release received significant attention from the AI research community, with researchers highlighting the efficiency gains and open-source availability as key advantages for advancing world modeling research.
Key Takeaways
- SANA-WM is a 2.6B parameter world model that generates 720p, 1-minute videos with 6-DoF camera control on a single GPU
- Training completes in 15 days on 64 H100 GPUs using only 213K public video clips with metric-scale pose supervision
- The distilled variant runs on a single RTX 5090, denoising 60-second 720p clips in 34 seconds using NVFP4 quantization
- Hybrid Linear Attention combines Gated DeltaNet with softmax attention for memory-efficient long-context modeling
- The model achieves visual quality comparable to industrial baselines like LingBot-World while significantly improving efficiency