Researchers introduce PACE, a framework that constructs proxy benchmarks to accurately predict an LLM's performance on expensive agentic evaluations using scores from a small subset of non-agentic atomic capabilities. By fitting a regression model to map scores from curated instances to target agentic benchmarks, the resulting PACE-Bench achieves high predictive accuracy at a fraction of the cost.
- Experiments across 14 models and 4 agentic benchmarks show PACE-Bench predicts agentic scores with a leave-one-out cross-validation mean absolute error under 4% and Spearman correlation above 0.80.
- The framework achieves around 85% pairwise model-ranking accuracy while costing less than 1% of the full agentic evaluation cost.
- Analysis of selected proxy instances reveals which specific skills each agentic benchmark uniquely demands.
PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing without the overhead of running full agent evaluations.