The authors introduce DiscoBench, a benchmark designed to evaluate whether search agents powered by large language models can proactively identify ambiguity and ask effective clarification questions during deep search tasks. Unlike existing benchmarks that assume complete user queries, this framework addresses the reality of vague or underspecified requests in real-world scenarios.

  • The dataset contains 211 samples and 463 ambiguity instances across 11 real-world domains, covering four distinct types of ambiguity.
  • A user simulator is designed to facilitate multi-turn interaction for evaluating model performance.
  • Evaluation metrics include task utility, ambiguity detection, interaction strategy, and cost efficiency.
  • Experiments on representative LLMs reveal that ambiguity detection and effective clarification are distinct capabilities.
  • Results show that repeatedly searching instead of asking for clarification often performs worse than direct guessing.

This work highlights a critical gap between retrieval ability and interactive problem-solving in current search agents, emphasizing the need for models to handle underspecified queries effectively.