PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL evaluates long-horizon planning in LLM agents across 1,665 tools through 327 retail tasks. It introduces a blocking mechanism to simulate real-world tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and error handling.