AGORA: Benchmark for Agentic Workplace Document Reasoning

Agora introduces a benchmark with 362 questions and 9,664 authentic workplace documents totaling 372M tokens, exceeding any model's context window. It evaluates agents' ability to explore documents deliberately, reconcile inconsistencies, and reason across domains, revealing that even top models achieve only 59.4% accuracy.