Recent agentic reinforcement learning methods for large language models often suffer from instability or limited gains in tool-use tasks. Experiments reveal that some models experience catastrophic collapse, where performance drops abruptly and tool-invocation structures fail. Analysis shows these failures stem from unexpected probability spikes in specific control tokens that disrupt structured execution. Despite this disruption, the underlying tool-use capability remains intact but is obscured by specific formatting issues. To address this, the study investigates diverse supervisory signals including off-policy supervision and hint-based guidance under various training schemes. The authors find that interleaving supervised fine-tuning with reinforcement learning substantially improves stability during training. However, this approach exhibits degraded performance when evaluated on format and content out-of-distribution data. The results highlight the importance of understanding RL failures to enable robust training for complex multi-step tool-use tasks.
Multi-Step Tool-Use RL Collapse and Supervisory Fixes
from English