arxiv arXiv cs.AI · 7d ago · research

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

from English

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

Importance 3/3 New feature vs. leaders New harness with differentiators arXiv cs.AI Allen AI Evaluation & benchmarks Reasoning models Training methods

Benchmarks

Benchmark	Model	Score
AIME 2025	STARE	8%
AIME 2024	STARE	4%

Read original