As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.
arxiv
arXiv cs.CL
·
8d ago
·
research
Routing Accuracy Degradation and Recovery in Enterprise Agent Systems
from English
Importance 3/3
New harness with differentiators
arXiv cs.CL
OpenAI
Anthropic
Google DeepMind
AI agents
Evaluation & benchmarks
Research paper
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench Verified | three frontier models | — |