Offline evaluation of agentic systems often produces tied comparisons in 75% of cases using standard success-based metrics. Preference-based trajectory evaluation reduces ties to 35% by comparing progress and time-to-return profiles, enhancing discriminative power and data efficiency. These results suggest benchmark saturation may stem from evaluation method choice, not just data or problem difficulty.
arxiv
arXiv cs.LG
·
8d ago
·
src: 9d ago
·
research
Preference-Based Trajectory Evaluation for Agentic Systems
from English
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench | offline preference-based trajectory evaluation | 75% |
| SWE-bench Verified | offline preference-based trajectory evaluation | 35% |