Researchers from multiple institutions have introduced DataMaster, an autonomous data engineering framework that achieved a 32.27% improvement in medal rate on MLE-Bench Lite by optimizing the data side of machine learning pipelines while leaving learning algorithms unchanged. Published on arXiv on May 11, 2026, the paper by Yaxin Du, Xiyuan Yang, and team addresses task-conditioned autonomous data engineering, where an agent discovers, selects, cleans, and transforms data to improve downstream model performance.
The framework shifts focus from model architecture and training recipes to data optimization, reflecting the reality that standardized model families and compute budgets make data engineering the next frontier for ML system improvements.
DataMaster Architecture Combines Three Core Components
DataMaster integrates three key components to handle the open-ended search space and delayed validation inherent in autonomous data engineering. DataTree organizes alternative data-engineering branches in a tree structure, allowing systematic exploration of data preparation strategies. The Data Pool stores discovered external data sources for reuse across branches, reducing redundant data discovery. Global Memory records node outcomes, artifacts, and reusable findings to carry evidence across branches and inform future decisions.
These components enable the agent to discover candidate data sources, construct executable training inputs, evaluate them through downstream feedback, and refine approaches based on cumulative learning.
Benchmark Results Show Significant Performance Gains
On MLE-Bench Lite, DataMaster improved the medal rate by 32.27% over the initial baseline score. On PostTrainBench, the framework surpassed the instruct model on GPQA, achieving 31.02% versus 30.35%. The researchers noted that as model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data, yet data engineering remains largely manual and ad hoc.
The framework addresses this gap by automating data discovery, selection, composition, cleaning, and transformation while maintaining compatibility with fixed learning algorithms.
Key Takeaways
- DataMaster achieved a 32.27% improvement in medal rate on MLE-Bench Lite through autonomous data engineering
- The framework uses three components: DataTree for branch organization, Data Pool for shared data sources, and Global Memory for cumulative learning
- On PostTrainBench's GPQA benchmark, DataMaster scored 31.02% compared to 30.35% for the baseline instruct model
- The system addresses the open-ended search space and delayed validation challenges through tree-structured search and shared candidate data
- Published on arXiv on May 11, 2026, by researchers including Yaxin Du and Xiyuan Yang from multiple institutions