The article introduces SAFARI, a framework designed to diagnose failures in autonomous agents by replacing linear context loading with a tool-augmented diagnostic loop. This approach decouples diagnostic accuracy from architectural context limits by using specialized tools and short-term memory to analyze trajectory segments.
- SAFARI utilizes a toolbox for reading and searching trajectory segments alongside persistent Short-Term Memory for cross-turn reasoning.
- It outperforms state-of-the-art results by 20% on the Who&When dataset within a 1M token budget.
- The framework achieves a 19% improvement on the TRAIL GAIA subset using a 25K token budget.
- SAFARI maintains 0.58 precision even when the target fault is located five times beyond the model's native context window.
This method allows for effective failure diagnosis in long-horizon tasks where traditional evaluators fail due to context window constraints.