Researchers introduce SABER-Math, the first fully automated benchmark for evaluating mathematical information retrieval without expert annotation, addressing the difficulty of isolating retriever effects on downstream performance.
- The benchmark utilizes 283K high-school-level math problems to create challenging reranking tasks through LLM-extracted summaries and ontology-based similarities.
- A Swiss-style LLM preference tournament generates fine-grained relevance ratings for documents within these tasks.
- Evaluation reveals that modern embedding models outperform classical baselines but struggle in symbol-heavy domains like Algebra and Calculus.
- General-purpose benchmarks such as MTEB fail to reliably predict mathematical performance, highlighting the need for specialized evaluation tools.
The study highlights the necessity of math-specific retrieval benchmarks because existing general-purpose evaluations do not accurately reflect performance on complex mathematical tasks.