STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.
arxiv
arXiv cs.AI
·
7d ago
·
research
STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
from English
Importance 3/3
New feature vs. leaders
New harness with differentiators
arXiv cs.AI
Allen AI
Evaluation & benchmarks
Reasoning models
Training methods