Rubric-Conditioned Self-Distillation introduces a framework that uses structured rubrics to provide fine-grained, token-level feedback during self-distillation of reasoning language models. By conditioning teacher models on rubric-level criteria, it enables more precise credit assignment than scalar rewards, outperforming GRPO and OPSD by 1.0 and 0.9 points on average across science reasoning benchmarks.
arxiv
arXiv cs.AI
·
7d ago
·
research
Rubric-Conditioned Self-Distillation Framework
from English
Importance 3/3
Beats a top-lab benchmark
New feature vs. leaders
arXiv cs.AI
Allen AI
Microsoft Research
OpenAI
Evaluation & benchmarks
Reasoning models
Training methods