arxiv arXiv cs.AI · 7d ago · research

Self-Conditioned Credit Assignment for RL with Verifiable Rewards

from English

SC-GRPO uses per-token KL divergence from self-conditioned trajectories to weight gradients in reinforcement learning. It outperforms GRPO by 8.1% and DAPO by 5.9% across math, code, and agentic tasks, with superior out-of-distribution performance and better results than OPD.

Importance 3/3 Beats a top-lab benchmark New feature vs. leaders arXiv cs.AI OpenAI Google DeepMind Meta AI Evaluation & benchmarks Reasoning models Training methods

Benchmarks

Benchmark	Model	Score
GSM8K	SC-GRPO	8.1%
GSM8K	DAPO	5.9%
GSM8K	OPD	—

Read original