SRPO Unifies Competing Reinforcement Learning Strategies for LLM Post-Training

Researchers from multiple institutions have introduced Sample-Routed Policy Optimization (SRPO), a new reinforcement learning framework that resolves a fundamental tension in LLM post-training by combining the strengths of two competing approaches. Published on arXiv on April 2, 2026, the paper addresses how Group Relative Policy Optimization (GRPO) provides stable long-term training but lacks token-level precision, while Self-Distillation Policy Optimization (SDPO) enables rapid early improvement but frequently collapses during extended training.

SRPO Routes Samples Based on Success to Optimize Credit Assignment

The core innovation of SRPO is its sample routing mechanism: correct samples are directed to GRPO's reward-aligned reinforcement learning, while failed samples receive SDPO's targeted logit-level correction. This routing strategy addresses SDPO's two intrinsic flaws that cause late-stage instability—self-distillation on already-correct samples creates optimization ambiguity, and the self-teacher's signal reliability progressively degrades over training.

SRPO incorporates an entropy-aware dynamic weighting mechanism that suppresses high-entropy, unreliable distillation targets while emphasizing confident predictions. This allows the framework to maintain both the rapid initial learning characteristic of SDPO and the long-horizon stability of GRPO.

Benchmark Results Show Consistent Improvements Across Model Scales

Evaluated across five benchmarks and two model scales, SRPO achieved measurable performance gains:

Raised the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO
Consistently surpassed peak performance of both baseline methods
Reduced per-step compute cost by up to 17.2%
Produced moderate response lengths while maintaining quality

The paper's authors—Gengsheng Li, Tianyu Yang, Junfeng Fang, and their team—demonstrated that SRPO provides a unified framework that achieves fast initial learning without sacrificing long-term optimization stability.

Key Takeaways

SRPO routes correct samples to GRPO reinforcement learning and failed samples to SDPO logit-level correction, combining strengths of both approaches
The framework addresses SDPO's late-stage collapse through entropy-aware dynamic weighting that filters unreliable distillation targets
On Qwen3-8B, SRPO improved five-benchmark average performance by 3.4% over GRPO and 6.3% over SDPO
SRPO reduces per-step compute cost by up to 17.2% compared to baseline methods
The unified framework matters for production LLM training where both sample efficiency and reliability are critical requirements

SRPO Routes Samples Based on Success to Optimize Credit Assignment

Benchmark Results Show Consistent Improvements Across Model Scales

Evaluated across five benchmarks and two model scales, SRPO achieved measurable performance gains:

Raised the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO

Consistently surpassed peak performance of both baseline methods

Reduced per-step compute cost by up to 17.2%

Produced moderate response lengths while maintaining quality

Key Takeaways

SRPO routes correct samples to GRPO reinforcement learning and failed samples to SDPO logit-level correction, combining strengths of both approaches

The framework addresses SDPO's late-stage collapse through entropy-aware dynamic weighting that filters unreliable distillation targets

On Qwen3-8B, SRPO improved five-benchmark average performance by 3.4% over GRPO and 6.3% over SDPO

SRPO reduces per-step compute cost by up to 17.2% compared to baseline methods

The unified framework matters for production LLM training where both sample efficiency and reliability are critical requirements