SC-GRPO uses per-token KL divergence from self-conditioned trajectories to weight gradients in reinforcement learning. It outperforms GRPO by 8.1% and DAPO by 5.9% across math, code, and agentic tasks, with superior out-of-distribution performance and better results than OPD.
arxiv
arXiv cs.AI
·
7d ago
·
research
Self-Conditioned Credit Assignment for RL with Verifiable Rewards
from English
Importance 3/3
Beats a top-lab benchmark
New feature vs. leaders
arXiv cs.AI
OpenAI
Google DeepMind
Meta AI
Evaluation & benchmarks
Reasoning models
Training methods