Researchers introduce SABER-Math, the first fully automated benchmark for evaluating mathematical information retrieval without expert annotation, addressing the difficulty of isolating retriever effects on downstream performance.

  • The benchmark utilizes 283K high-school-level math problems to create challenging reranking tasks through LLM-extracted summaries and ontology-based similarities.
  • A Swiss-style LLM preference tournament generates fine-grained relevance ratings for documents within these tasks.
  • Evaluation reveals that modern embedding models outperform classical baselines but struggle in symbol-heavy domains like Algebra and Calculus.
  • General-purpose benchmarks such as MTEB fail to reliably predict mathematical performance, highlighting the need for specialized evaluation tools.

The study highlights the necessity of math-specific retrieval benchmarks because existing general-purpose evaluations do not accurately reflect performance on complex mathematical tasks.