VIMPO introduces a critic-free policy optimization method that derives a policy-implied value function from KL-regularized reinforcement learning. It enables verifiable reward incorporation without training a critic and outperforms GRPO on mathematical benchmarks, especially under noisy rewards.
arxiv
arXiv cs.LG
·
6d ago
·
research
VIMPO: Critic-Free Policy Optimization for LLMs
from English
Importance 3/3
Beats a top-lab benchmark
New feature vs. leaders
arXiv cs.LG
OpenAI
Google DeepMind
Meta AI
Evaluation & benchmarks
Reasoning models
Training methods