The authors argue that mainstream Reinforcement Learning with Verifiable Rewards (RLVR) often fails to expand the reasoning capacity of large language models, merely reallocating probabilities among existing trajectories. To address this limitation, they introduce a boundary-aware Curriculum RL approach designed to move beyond the base model's empirical reasoning capacity boundary. The method first utilizes pass@k sampling to identify the current reasoning limits and then applies targeted teacher guidance to examples near or beyond that boundary. Reinforcement learning is subsequently used to consolidate these newly introduced reasoning patterns across Qwen, Llama, and DeepSeek base models. Experimental results demonstrate significant improvements in both pass@1 scores and pass@256 scores, which serve as a proxy for the reasoning capacity boundary. Specifically, average pass@256 improved by 9.8 percentage points over the base models and by 10.3 percentage points over Vanilla RLVR. These findings suggest that this curriculum-based strategy offers a scalable route for continuously improving LLM reasoning capabilities.
Boundary-Aware Curriculum RL Expands LLM Reasoning Capacity Beyond Base Model Limits
from English