Preference-Based Trajectory Evaluation for Agentic Systems
Offline evaluation of agentic systems often produces tied comparisons in 75% of cases using standard success-based metrics. Preference-based trajectory evaluation reduces ties to 35% by comparing progress and time-to-return profiles, enhancing discriminative power and data efficiency. These results suggest benchmark saturation may stem from evaluation method choice, not just data or problem difficulty.