Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

This study introduces approach-level diversity to address the gap between surface-level variation and actual strategic differences in LLM mathematical reasoning. It demonstrates that prior metrics fail to capture true methodological diversity, leading to a decline in approach-level diversity during diversity-aware RLVR training.

The authors introduce approach-level diversity as variation in strategies across correct solutions to the same problem.
A human-calibrated LLM judge framework reveals that prior diversity measures are unreliable proxies for approach-level diversity.
Diversity-aware RLVR preserves target metrics while causing approach-level diversity to decline.
Approach-diverse candidate sets improve test-time scaling performance.
Optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broadening approaches.

The work marks a step toward LLMs that reason in genuinely diverse, human-like ways by uncovering a systematic divergence between surface- and approach-level signals.