STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.