RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents
RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.