Adversarial framework augments RLVR with human demonstration signals

The authors propose an adversarial generator-discriminator framework that enhances Reinforcement Learning with Verifiable Rewards (RLVR) by incorporating learned signals from human demonstrations to address issues like diversity collapse and unnatural outputs.

The generator maximizes task accuracy alongside an adversarial reward derived from a discriminator trained to distinguish human-written outputs from model-generated ones.
This approach improves non-verifiable properties across domains while preserving RLVR accuracy gains, such as lower edit distance in bug fixing and higher win rates in story generation.
The method nearly eliminates model misbehavior on reward hacking benchmarks while maintaining high scores, bridging RL and Supervised Fine-Tuning (SFT).

This approach offers a scalable path toward jointly optimizing the verifiable and non-verifiable properties of a task.