Omni123 Achieves 3D-Native Generation by Unifying Text-to-2D and Text-to-3D

Researchers have introduced Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Published on arXiv on April 2, 2026 by Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, and Xiaoguang Han, the model addresses a fundamental bottleneck: while multimodal large language models excel at text and image tasks, extending native generation capability to 3D remains challenging due to limited high-quality 3D training data.

Cross-Modal Consistency Serves as Implicit Geometric Constraint

Omni123's key insight is using cross-modal consistency between images and 3D as an implicit structural constraint. The model represents text, images, and 3D as discrete tokens in a shared sequence space, allowing it to leverage abundant 2D data as a geometric prior for improving 3D representations. This approach bypasses the data scarcity problem that has limited previous 3D generation systems.

Existing methods rely on indirect pipelines that edit in 2D and lift results into 3D through optimization, sacrificing geometric consistency. Omni123 instead trains natively on 3D representations while using 2D images to provide supervision.

Interleaved X-to-X Training Coordinates Cross-Modal Tasks

The paper introduces an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. The model enforces consistency by traversing semantic-visual-geometric cycles within autoregressive sequences:

Text to image to 3D to image cycles enforce semantic alignment
Multi-view geometric consistency is maintained across modalities
Appearance fidelity is jointly optimized with geometric structure

Experiments demonstrate that Omni123 significantly improves text-guided 3D generation and editing compared to existing approaches. The framework provides a scalable path toward multimodal 3D world models, with implications for gaming, AR/VR, and digital twin applications where 3D content creation remains a major bottleneck.

Key Takeaways

Omni123 unifies text-to-2D and text-to-3D generation in a single autoregressive framework, treating all modalities as discrete token sequences
The model uses cross-modal consistency between images and 3D as an implicit geometric constraint, leveraging abundant 2D data to improve 3D representations
An interleaved X-to-X training paradigm coordinates cross-modal tasks without requiring fully aligned text-image-3D triplets
The approach bypasses 3D data scarcity by using 2D images as geometric priors during training
Omni123 provides a scalable path toward multimodal 3D world models with applications in gaming, AR/VR, and digital twins

Cross-Modal Consistency Serves as Implicit Geometric Constraint

Interleaved X-to-X Training Coordinates Cross-Modal Tasks

Text to image to 3D to image cycles enforce semantic alignment

Multi-view geometric consistency is maintained across modalities

Appearance fidelity is jointly optimized with geometric structure

Key Takeaways

Omni123 unifies text-to-2D and text-to-3D generation in a single autoregressive framework, treating all modalities as discrete token sequences

The model uses cross-modal consistency between images and 3D as an implicit geometric constraint, leveraging abundant 2D data to improve 3D representations

An interleaved X-to-X training paradigm coordinates cross-modal tasks without requiring fully aligned text-image-3D triplets

The approach bypasses 3D data scarcity by using 2D images as geometric priors during training

Omni123 provides a scalable path toward multimodal 3D world models with applications in gaming, AR/VR, and digital twins