Vero: Open RL Recipe for General Visual Reasoning Released

Researchers from Princeton and Meta have released Vero, a family of fully open vision-language models that match or exceed existing open-weight models across diverse visual reasoning tasks. Published on arXiv on April 6, 2026, the research addresses a critical gap in AI transparency by releasing a complete reinforcement learning pipeline that proprietary models have kept locked away.

The release includes all training data, code, and models, enabling reproducible research on visual reasoning systems. The 600,000-sample dataset constructed from 59 different datasets represents significant effort in data curation and reward design across heterogeneous tasks.

Vero Improves Base Models by 3.7-5.5 Points on Average

The research team, led by Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, and Zhuang Liu, demonstrated that Vero improves over four base models by 3.7 to 5.5 points on average across VeroEval, their comprehensive 30-benchmark suite.

Notably, when starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without requiring additional proprietary thinking data. This achievement challenges the assumption that proprietary training approaches hold significant advantages over open methods.

Vero-600K Dataset Covers Six Broad Task Categories

The Vero-600K dataset includes 600,000 reinforcement learning training samples constructed from 59 different datasets. The data covers six broad task categories:

Charts and data visualization
Scientific reasoning
Spatial understanding
Open-ended visual tasks
Mathematical reasoning from images
General visual question answering

The researchers designed task-routed rewards to handle heterogeneous answer formats across these diverse categories. When trained from the same base model, Vero-600K exceeds existing RL datasets across all task categories, demonstrating the effectiveness of broad data coverage.

Research Reveals Broad Data Coverage Drives RL Scaling

Systematic ablations conducted by the research team revealed that different task categories elicit qualitatively distinct reasoning patterns. Critically, these patterns transfer poorly in isolation, suggesting that broad data coverage rather than specialized reasoning templates drives strong reinforcement learning scaling.

This finding has significant implications for future visual reasoning systems. Rather than focusing on developing specialized reasoning approaches for specific task types, the research suggests that comprehensive coverage across diverse visual reasoning tasks produces more robust and generalizable models.

The full open release of data, code, and models enables the research community to validate these findings and build upon the work without recreating the entire training pipeline from scratch.

Key Takeaways

Princeton and Meta researchers released Vero on arXiv (paper 2604.04917) on April 6, 2026, with complete data, code, and models
Vero improves over four base models by 3.7-5.5 points on average across a 30-benchmark evaluation suite
Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without proprietary thinking data
The Vero-600K dataset includes 600,000 RL training samples from 59 datasets covering six broad task categories
Research findings suggest broad data coverage, not specialized reasoning templates, is the primary driver of strong RL scaling in visual reasoning

Vero Improves Base Models by 3.7-5.5 Points on Average

Vero-600K Dataset Covers Six Broad Task Categories

The Vero-600K dataset includes 600,000 reinforcement learning training samples constructed from 59 different datasets. The data covers six broad task categories:

Charts and data visualization

Scientific reasoning

Spatial understanding

Open-ended visual tasks

Mathematical reasoning from images

General visual question answering

Research Reveals Broad Data Coverage Drives RL Scaling

The full open release of data, code, and models enables the research community to validate these findings and build upon the work without recreating the entire training pipeline from scratch.

Key Takeaways

Princeton and Meta researchers released Vero on arXiv (paper 2604.04917) on April 6, 2026, with complete data, code, and models

Vero improves over four base models by 3.7-5.5 points on average across a 30-benchmark evaluation suite

Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without proprietary thinking data

The Vero-600K dataset includes 600,000 RL training samples from 59 datasets covering six broad task categories

Research findings suggest broad data coverage, not specialized reasoning templates, is the primary driver of strong RL scaling in visual reasoning