SandMLE Framework Enables On-Policy RL for Machine Learning Agent Training

Researchers have released SandMLE, a multi-agent framework that addresses a critical bottleneck in training machine learning engineering (MLE) agents by reducing verification time by over 13x. Published on arXiv on April 6, 2026, the framework enables large-scale, on-policy reinforcement learning for MLE tasks—previously impractical due to computational costs.

Traditional MLE Training Faces Verification Bottleneck

Unlike software engineering agents that verify code with fast unit tests, MLE agents require full ML pipeline execution at each rollout step—including data preprocessing, model training, and metric evaluation on large datasets. This makes each verification step orders of magnitude more expensive, forcing existing approaches to retreat to supervised fine-tuning or offline proxy rewards, sacrificing the exploration benefits of on-policy RL.

SandMLE Generates Micro-Scale Synthetic Environments

The framework generates diverse, verifiable synthetic MLE environments from a small number of seed tasks while preserving the structural and technical complexity of real-world problems. The key innovation: constraining datasets to micro-scale with only 50-200 training samples per task. This observation that sandbox data size—not computational complexity—is the primary bottleneck enables practical RL training while maintaining task fidelity.

Performance Improvements Across Model Sizes

Tested on MLE-bench-lite across Qwen3-8B, 14B, and 30B-A3B models, SandMLE achieved relative medal rate improvements ranging from 20.3% to 66.9% over supervised fine-tuning baselines. The trained policy generalizes across unseen agentic scaffolds and achieves up to 32.4% better HumanRank scores on MLE-Dojo. Authors Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, and Hong Yan note that the framework enables exploration-based learning previously impossible in the MLE domain.

Key Takeaways

SandMLE reduces MLE agent verification time by over 13x through micro-scale synthetic datasets of 50-200 samples
Achieves 20.3% to 66.9% relative medal rate improvements over supervised fine-tuning baselines on MLE-bench-lite
Enables first-ever large-scale on-policy trajectory-wise RL for machine learning engineering tasks
Trained policies generalize across unseen agentic scaffolds with up to 32.4% better HumanRank scores on MLE-Dojo
Addresses critical evolution from software engineering agents to ML engineering agents that design and train pipelines

Traditional MLE Training Faces Verification Bottleneck

SandMLE Generates Micro-Scale Synthetic Environments

Performance Improvements Across Model Sizes

Key Takeaways

SandMLE reduces MLE agent verification time by over 13x through micro-scale synthetic datasets of 50-200 samples

Achieves 20.3% to 66.9% relative medal rate improvements over supervised fine-tuning baselines on MLE-bench-lite

Enables first-ever large-scale on-policy trajectory-wise RL for machine learning engineering tasks

Trained policies generalize across unseen agentic scaffolds with up to 32.4% better HumanRank scores on MLE-Dojo

Addresses critical evolution from software engineering agents to ML engineering agents that design and train pipelines