VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

VeriEvol is an iterative framework designed to scale multimodal mathematical reasoning by decoupling prompt difficulty from answer reliability during data construction. It employs a type-aware evolution module to generate harder prompts and the HTV-Agent verifier to ensure answer correctness through multi-source counter-evidence.

The type-aware evolution module rewrites low-difficulty image-question seeds into harder, image-grounded prompts using route-specific operators.
The HTV-Agent verifier accepts answers only after multi-source counter-evidence fails to refute them.
Scaling evolved SFT data from 10K to 250K samples raises mean accuracy on a five-benchmark suite from 35.42 to 54.73.
VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, with +1.82 from evolved prompts and +2.06 from the verifier.

The framework allows downstream work to scale and audit the data pipeline by releasing prompts, data, models, code, and full verifier traces for every sample.