SABER-Math: Automated Benchmark for Information Retrieval Evaluation in Mathematics

Researchers introduce SABER-Math, the first fully automated benchmark for evaluating mathematical information retrieval without expert annotation, addressing the difficulty of isolating retriever effects on downstream performance.

The benchmark utilizes 283K high-school-level math problems to create challenging reranking tasks through LLM-extracted summaries and ontology-based similarities.
A Swiss-style LLM preference tournament generates fine-grained relevance ratings for documents within these tasks.
Evaluation reveals that modern embedding models outperform classical baselines but struggle in symbol-heavy domains like Algebra and Calculus.
General-purpose benchmarks such as MTEB fail to reliably predict mathematical performance, highlighting the need for specialized evaluation tools.

The study highlights the necessity of math-specific retrieval benchmarks because existing general-purpose evaluations do not accurately reflect performance on complex mathematical tasks.