Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.