Researchers Indraneil Paul, Glavaš Glavas, and Iryna Gurevych published "Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring" on arXiv on May 1, 2026, introducing the largest open-source collection of code preferences to date with more than 350,000 preference pairs. The paper addresses critical limitations in existing code reward models that focus almost exclusively on execution feedback and functional correctness, missing dimensions like code quality, readability, efficiency, and style.
Three Major Contributions Address Code Evaluation Limitations
The research provides three core contributions:
- Themis-CodeRewardBench: A benchmark evaluating code reward models across five preference dimensions and eight programming languages, profiling 50+ existing code, math, and general-purpose reward models
- Themis-CodePreference: More than 350,000 preference pairs diverse across languages, criteria, and quality dimensions
- Themis-RM Suite: Multilingual code reward models in sizes ranging from 600M to 32B parameters with flexible multi-criteria scoring capability
Existing Code Reward Models Show Limited Proficiency Beyond Functional Correctness
The Themis-CodeRewardBench benchmark revealed "limited proficiency of current RMs beyond scoring for functional correctness." Most existing code reward models create binary pass/fail evaluations that don't capture important quality dimensions. The research found this approach particularly problematic for production code systems where style, efficiency, readability, maintainability, and documentation matter as much as whether code executes correctly.
Larger Models Show Positive Scaling and Strong Cross-Lingual Transfer
Key findings from training the Themis-RM suite include:
- Positive scaling trends: Larger models from 600M to 32B parameters demonstrate better performance across evaluation criteria
- Strong cross-lingual transfer: Models trained on diverse multilingual preferences generalize well across languages without requiring language-specific models
- Multi-criteria training importance: Training on diverse quality dimensions proves crucial for reliable code reward modeling
350K Preference Pairs Enable Nuanced Code Quality Evaluation
The Themis-CodePreference dataset represents a major open-source contribution enabling training for nuanced code evaluation beyond simple execution tests. Code reward models are becoming critical for post-training language models through reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF), but current models are too narrow in scope and limited to single languages. Themis enables evaluation across style, efficiency, readability, maintainability, and documentation in eight programming languages.
Key Takeaways
- Researchers published Themis with more than 350,000 code preference pairs on May 1, 2026, the largest open-source code preference collection to date
- The Themis-RM suite includes multilingual code reward models ranging from 600M to 32B parameters for multi-criteria evaluation across eight programming languages
- Existing code reward models show limited proficiency beyond functional correctness, missing quality dimensions like style, efficiency, and readability
- Larger models demonstrate positive scaling trends and strong cross-lingual transfer when trained on diverse multilingual preferences
- Multi-criteria training enables more nuanced code evaluation essential for production systems beyond simple pass/fail execution tests