Researchers have introduced Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Published on arXiv on April 2, 2026 by Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, and Xiaoguang Han, the model addresses a fundamental bottleneck: while multimodal large language models excel at text and image tasks, extending native generation capability to 3D remains challenging due to limited high-quality 3D training data.
Cross-Modal Consistency Serves as Implicit Geometric Constraint
Omni123's key insight is using cross-modal consistency between images and 3D as an implicit structural constraint. The model represents text, images, and 3D as discrete tokens in a shared sequence space, allowing it to leverage abundant 2D data as a geometric prior for improving 3D representations. This approach bypasses the data scarcity problem that has limited previous 3D generation systems.
Existing methods rely on indirect pipelines that edit in 2D and lift results into 3D through optimization, sacrificing geometric consistency. Omni123 instead trains natively on 3D representations while using 2D images to provide supervision.
Interleaved X-to-X Training Coordinates Cross-Modal Tasks
The paper introduces an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. The model enforces consistency by traversing semantic-visual-geometric cycles within autoregressive sequences:
- Text to image to 3D to image cycles enforce semantic alignment
- Multi-view geometric consistency is maintained across modalities
- Appearance fidelity is jointly optimized with geometric structure
Experiments demonstrate that Omni123 significantly improves text-guided 3D generation and editing compared to existing approaches. The framework provides a scalable path toward multimodal 3D world models, with implications for gaming, AR/VR, and digital twin applications where 3D content creation remains a major bottleneck.
Key Takeaways
- Omni123 unifies text-to-2D and text-to-3D generation in a single autoregressive framework, treating all modalities as discrete token sequences
- The model uses cross-modal consistency between images and 3D as an implicit geometric constraint, leveraging abundant 2D data to improve 3D representations
- An interleaved X-to-X training paradigm coordinates cross-modal tasks without requiring fully aligned text-image-3D triplets
- The approach bypasses 3D data scarcity by using 2D images as geometric priors during training
- Omni123 provides a scalable path toward multimodal 3D world models with applications in gaming, AR/VR, and digital twins