CFPO: Counterfactual Policy Optimization for Multimodal Reasoning
CFPO introduces a cross-modal counterfactual enhancement mechanism to improve causal consistency between visual perception and textual reasoning in vision-language models. It achieves 3.17%-6.25% gains over standard RL baselines and 1.32%-2.13% over PAPO, without requiring external rewards or supervision.