SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation

The article introduces SAFARI, a framework designed to diagnose failures in autonomous agents by replacing linear context loading with a tool-augmented diagnostic loop. This approach decouples diagnostic accuracy from architectural context limits by using specialized tools and short-term memory to analyze trajectory segments.

SAFARI utilizes a toolbox for reading and searching trajectory segments alongside persistent Short-Term Memory for cross-turn reasoning.
It outperforms state-of-the-art results by 20% on the Who&When dataset within a 1M token budget.
The framework achieves a 19% improvement on the TRAIL GAIA subset using a 25K token budget.
SAFARI maintains 0.58 precision even when the target fault is located five times beyond the model's native context window.

This method allows for effective failure diagnosis in long-horizon tasks where traditional evaluators fail due to context window constraints.

Benchmarks