The authors propose an adversarial generator-discriminator framework that enhances Reinforcement Learning with Verifiable Rewards (RLVR) by incorporating learned signals from human demonstrations to address issues like diversity collapse and unnatural outputs.

  • The generator maximizes task accuracy alongside an adversarial reward derived from a discriminator trained to distinguish human-written outputs from model-generated ones.
  • This approach improves non-verifiable properties across domains while preserving RLVR accuracy gains, such as lower edit distance in bug fixing and higher win rates in story generation.
  • The method nearly eliminates model misbehavior on reward hacking benchmarks while maintaining high scores, bridging RL and Supervised Fine-Tuning (SFT).

This approach offers a scalable path toward jointly optimizing the verifiable and non-verifiable properties of a task.