VeriEvol is an iterative framework designed to scale multimodal mathematical reasoning by decoupling prompt difficulty from answer reliability during data construction. It employs a type-aware evolution module to generate harder prompts and the HTV-Agent verifier to ensure answer correctness through multi-source counter-evidence.
- The type-aware evolution module rewrites low-difficulty image-question seeds into harder, image-grounded prompts using route-specific operators.
- The HTV-Agent verifier accepts answers only after multi-source counter-evidence fails to refute them.
- Scaling evolved SFT data from 10K to 250K samples raises mean accuracy on a five-benchmark suite from 35.42 to 54.73.
- VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, with +1.82 from evolved prompts and +2.06 from the verifier.
The framework allows downstream work to scale and audit the data pipeline by releasing prompts, data, models, code, and full verifier traces for every sample.