Researchers from Princeton and Meta have released Vero, a family of fully open vision-language models that match or exceed existing open-weight models across diverse visual reasoning tasks. Published on arXiv on April 6, 2026, the research addresses a critical gap in AI transparency by releasing a complete reinforcement learning pipeline that proprietary models have kept locked away.
The release includes all training data, code, and models, enabling reproducible research on visual reasoning systems. The 600,000-sample dataset constructed from 59 different datasets represents significant effort in data curation and reward design across heterogeneous tasks.
Vero Improves Base Models by 3.7-5.5 Points on Average
The research team, led by Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, and Zhuang Liu, demonstrated that Vero improves over four base models by 3.7 to 5.5 points on average across VeroEval, their comprehensive 30-benchmark suite.
Notably, when starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without requiring additional proprietary thinking data. This achievement challenges the assumption that proprietary training approaches hold significant advantages over open methods.
Vero-600K Dataset Covers Six Broad Task Categories
The Vero-600K dataset includes 600,000 reinforcement learning training samples constructed from 59 different datasets. The data covers six broad task categories:
- Charts and data visualization
- Scientific reasoning
- Spatial understanding
- Open-ended visual tasks
- Mathematical reasoning from images
- General visual question answering
The researchers designed task-routed rewards to handle heterogeneous answer formats across these diverse categories. When trained from the same base model, Vero-600K exceeds existing RL datasets across all task categories, demonstrating the effectiveness of broad data coverage.
Research Reveals Broad Data Coverage Drives RL Scaling
Systematic ablations conducted by the research team revealed that different task categories elicit qualitatively distinct reasoning patterns. Critically, these patterns transfer poorly in isolation, suggesting that broad data coverage rather than specialized reasoning templates drives strong reinforcement learning scaling.
This finding has significant implications for future visual reasoning systems. Rather than focusing on developing specialized reasoning approaches for specific task types, the research suggests that comprehensive coverage across diverse visual reasoning tasks produces more robust and generalizable models.
The full open release of data, code, and models enables the research community to validate these findings and build upon the work without recreating the entire training pipeline from scratch.
Key Takeaways
- Princeton and Meta researchers released Vero on arXiv (paper 2604.04917) on April 6, 2026, with complete data, code, and models
- Vero improves over four base models by 3.7-5.5 points on average across a 30-benchmark evaluation suite
- Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without proprietary thinking data
- The Vero-600K dataset includes 600,000 RL training samples from 59 datasets covering six broad task categories
- Research findings suggest broad data coverage, not specialized reasoning templates, is the primary driver of strong RL scaling in visual reasoning