The article introduces Ko-WideSearch, a new benchmark designed to evaluate the breadth-search capabilities of web agents in Korean, addressing the lack of exhaustive set enumeration metrics outside English.

  • The benchmark utilizes an automated synthesize-and-verify pipeline to create tasks requiring full membership and attribute tables for 190 entities across 16 categories.
  • It spans 228 tables graded by Item-, Column-, and Row-F1, with difficulty controlled by table width and composite keys.
  • Evaluation of twenty web agents reveals a consistent failure pattern where agents recover sets but not individual rows, with accuracy dropping as structural complexity increases.
  • Analysis shows that finding the correct value in open-ended free-text cells is the primary challenge, while standard answers like dates or names are handled correctly.

This benchmark highlights the significant gap in current web agent performance regarding exhaustive data retrieval and provides a standardized method for assessing this specific capability.