Benchmark · math

AIME 2025

8 results 6 models

STARE greedy router baseline stochastic router adaptive prompt selection mechanism IW-OPD

Timeline

2026-06-24 IW-OPD 6.9pts Importance-Weighted On-Policy Distillation Addresses Position Bias
2026-06-19 adaptive prompt selection mechanism 28.1% Adaptive LLM Tutoring Improves Engagement and Efficiency
2026-06-19 greedy router 19.1% Adaptive LLM Tutoring Improves Engagement and Efficiency
2026-06-19 baseline 19.6% Adaptive LLM Tutoring Improves Engagement and Efficiency
2026-06-19 stochastic router 28.1% Adaptive LLM Tutoring Improves Engagement and Efficiency
2026-06-18 STARE 8.0% STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
2026-06-18 STARE 8.0% STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
2026-06-18 STARE 8.0% STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability