The paper introduces LatentRevise, a first-order latent revision method designed to recover training signals in reinforcement learning with verifiable rewards (RLVR) for prompts where correct trajectories are rarely sampled. By optimizing the input embeddings of a reasoning prefix based on failed rollouts and gold answers, the method generates useful data from previously unproductive attempts.
- LatentRevise optimizes the input embeddings of a reasoning prefix using two complementary gradients to move away from failed continuations and toward the gold answer.
- Updates are constrained to the convex hull of the model's vocabulary embeddings, ensuring modifications align with real token embeddings rather than arbitrary feature directions.
- Continuations generated from revised prefixes exhibit self-reflection, lengthen in duration, and successfully reach correct answers that original rollouts missed.
- Using these revised trajectories as training data improves supervised fine-tuning (SFT) and RLVR performance on math benchmarks compared to standard baselines.
This approach addresses the bottleneck of hard prompts in RLVR by turning failed rollouts into informative training signals, thereby enhancing model reasoning capabilities on mathematical tasks.