arxiv arXiv cs.LG · 8d ago · src: 9d ago · research

Preference-Based Trajectory Evaluation for Agentic Systems

from English

Offline evaluation of agentic systems often produces tied comparisons in 75% of cases using standard success-based metrics. Preference-based trajectory evaluation reduces ties to 35% by comparing progress and time-to-return profiles, enhancing discriminative power and data efficiency. These results suggest benchmark saturation may stem from evaluation method choice, not just data or problem difficulty.

Importance 2/3 arXiv cs.LG AI agents Evaluation & benchmarks Reasoning models

Benchmarks

Benchmark	Model	Score
SWE-bench	offline preference-based trajectory evaluation	75%
SWE-bench Verified	offline preference-based trajectory evaluation	35%

Read original