This study introduces approach-level diversity to address the gap between surface-level variation and actual strategic differences in LLM mathematical reasoning. It demonstrates that prior metrics fail to capture true methodological diversity, leading to a decline in approach-level diversity during diversity-aware RLVR training.
- The authors introduce approach-level diversity as variation in strategies across correct solutions to the same problem.
- A human-calibrated LLM judge framework reveals that prior diversity measures are unreliable proxies for approach-level diversity.
- Diversity-aware RLVR preserves target metrics while causing approach-level diversity to decline.
- Approach-diverse candidate sets improve test-time scaling performance.
- Optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broadening approaches.
The work marks a step toward LLMs that reason in genuinely diverse, human-like ways by uncovering a systematic divergence between surface- and approach-level signals.