Themis: 350K-Pair Dataset Trains Multilingual Code Reward Models for Multi-Criteria Evaluation

Researchers Indraneil Paul, Glavaš Glavas, and Iryna Gurevych published "Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring" on arXiv on May 1, 2026, introducing the largest open-source collection of code preferences to date with more than 350,000 preference pairs. The paper addresses critical limitations in existing code reward models that focus almost exclusively on execution feedback and functional correctness, missing dimensions like code quality, readability, efficiency, and style.

Three Major Contributions Address Code Evaluation Limitations

The research provides three core contributions:

Themis-CodeRewardBench: A benchmark evaluating code reward models across five preference dimensions and eight programming languages, profiling 50+ existing code, math, and general-purpose reward models
Themis-CodePreference: More than 350,000 preference pairs diverse across languages, criteria, and quality dimensions
Themis-RM Suite: Multilingual code reward models in sizes ranging from 600M to 32B parameters with flexible multi-criteria scoring capability

Existing Code Reward Models Show Limited Proficiency Beyond Functional Correctness

The Themis-CodeRewardBench benchmark revealed "limited proficiency of current RMs beyond scoring for functional correctness." Most existing code reward models create binary pass/fail evaluations that don't capture important quality dimensions. The research found this approach particularly problematic for production code systems where style, efficiency, readability, maintainability, and documentation matter as much as whether code executes correctly.

Larger Models Show Positive Scaling and Strong Cross-Lingual Transfer

Key findings from training the Themis-RM suite include:

Positive scaling trends: Larger models from 600M to 32B parameters demonstrate better performance across evaluation criteria
Strong cross-lingual transfer: Models trained on diverse multilingual preferences generalize well across languages without requiring language-specific models
Multi-criteria training importance: Training on diverse quality dimensions proves crucial for reliable code reward modeling

350K Preference Pairs Enable Nuanced Code Quality Evaluation

The Themis-CodePreference dataset represents a major open-source contribution enabling training for nuanced code evaluation beyond simple execution tests. Code reward models are becoming critical for post-training language models through reinforcement learning from human feedback (RLHF) and reinforcement learning from AI feedback (RLAIF), but current models are too narrow in scope and limited to single languages. Themis enables evaluation across style, efficiency, readability, maintainability, and documentation in eight programming languages.

Key Takeaways

Researchers published Themis with more than 350,000 code preference pairs on May 1, 2026, the largest open-source code preference collection to date
The Themis-RM suite includes multilingual code reward models ranging from 600M to 32B parameters for multi-criteria evaluation across eight programming languages
Existing code reward models show limited proficiency beyond functional correctness, missing quality dimensions like style, efficiency, and readability
Larger models demonstrate positive scaling trends and strong cross-lingual transfer when trained on diverse multilingual preferences
Multi-criteria training enables more nuanced code evaluation essential for production systems beyond simple pass/fail execution tests

Three Major Contributions Address Code Evaluation Limitations

The research provides three core contributions:

Themis-CodeRewardBench: A benchmark evaluating code reward models across five preference dimensions and eight programming languages, profiling 50+ existing code, math, and general-purpose reward models

Themis-CodePreference: More than 350,000 preference pairs diverse across languages, criteria, and quality dimensions

Themis-RM Suite: Multilingual code reward models in sizes ranging from 600M to 32B parameters with flexible multi-criteria scoring capability

Existing Code Reward Models Show Limited Proficiency Beyond Functional Correctness

Larger Models Show Positive Scaling and Strong Cross-Lingual Transfer

Key findings from training the Themis-RM suite include:

Positive scaling trends: Larger models from 600M to 32B parameters demonstrate better performance across evaluation criteria

Strong cross-lingual transfer: Models trained on diverse multilingual preferences generalize well across languages without requiring language-specific models

Multi-criteria training importance: Training on diverse quality dimensions proves crucial for reliable code reward modeling

350K Preference Pairs Enable Nuanced Code Quality Evaluation

Key Takeaways

Researchers published Themis with more than 350,000 code preference pairs on May 1, 2026, the largest open-source code preference collection to date

The Themis-RM suite includes multilingual code reward models ranging from 600M to 32B parameters for multi-criteria evaluation across eight programming languages

Existing code reward models show limited proficiency beyond functional correctness, missing quality dimensions like style, efficiency, and readability

Larger models demonstrate positive scaling trends and strong cross-lingual transfer when trained on diverse multilingual preferences

Multi-criteria training enables more nuanced code evaluation essential for production systems beyond simple pass/fail execution tests