Researchers from multiple institutions have introduced Sample-Routed Policy Optimization (SRPO), a new reinforcement learning framework that resolves a fundamental tension in LLM post-training by combining the strengths of two competing approaches. Published on arXiv on April 2, 2026, the paper addresses how Group Relative Policy Optimization (GRPO) provides stable long-term training but lacks token-level precision, while Self-Distillation Policy Optimization (SDPO) enables rapid early improvement but frequently collapses during extended training.
SRPO Routes Samples Based on Success to Optimize Credit Assignment
The core innovation of SRPO is its sample routing mechanism: correct samples are directed to GRPO's reward-aligned reinforcement learning, while failed samples receive SDPO's targeted logit-level correction. This routing strategy addresses SDPO's two intrinsic flaws that cause late-stage instability—self-distillation on already-correct samples creates optimization ambiguity, and the self-teacher's signal reliability progressively degrades over training.
SRPO incorporates an entropy-aware dynamic weighting mechanism that suppresses high-entropy, unreliable distillation targets while emphasizing confident predictions. This allows the framework to maintain both the rapid initial learning characteristic of SDPO and the long-horizon stability of GRPO.
Benchmark Results Show Consistent Improvements Across Model Scales
Evaluated across five benchmarks and two model scales, SRPO achieved measurable performance gains:
- Raised the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO
- Consistently surpassed peak performance of both baseline methods
- Reduced per-step compute cost by up to 17.2%
- Produced moderate response lengths while maintaining quality
The paper's authors—Gengsheng Li, Tianyu Yang, Junfeng Fang, and their team—demonstrated that SRPO provides a unified framework that achieves fast initial learning without sacrificing long-term optimization stability.
Key Takeaways
- SRPO routes correct samples to GRPO reinforcement learning and failed samples to SDPO logit-level correction, combining strengths of both approaches
- The framework addresses SDPO's late-stage collapse through entropy-aware dynamic weighting that filters unreliable distillation targets
- On Qwen3-8B, SRPO improved five-benchmark average performance by 3.4% over GRPO and 6.3% over SDPO
- SRPO reduces per-step compute cost by up to 17.2% compared to baseline methods
- The unified framework matters for production LLM training where both sample efficiency and reliability are critical requirements