arxiv arXiv cs.CL · 2d ago · research

EnterpriseClawBench: Real-World Agent Benchmark Released

from English

EnterpriseClawBench is a benchmark built from real workplace sessions, featuring 852 reproducible tasks with detailed metadata. The best configuration achieves only 0.663 (Codex with GPT-5.5), highlighting the need for multi-dimensional evaluation of enterprise agents.

Importance 2/3 New harness with differentiators arXiv cs.CL AI agents Evaluation & benchmarks

Read original