LatentRevise: Learning from Zero-Hit Reasoning
The paper introduces LatentRevise, a first-order latent revision method designed to recover training signals in reinforcement learning with verifiable rewards (RLVR) for prompts where correct trajectories are rarely sampled. By optimizing the input embeddings of a reasoning prefix based on failed rollouts and gold answers, the method generates useful data from previously unproductive attempts.