Reversal Q-Learning: A New Off-Policy RL Algorithm

Reversal Q-Learning (RQL) is a new off-policy reinforcement learning algorithm that trains a flow policy using prior data. By modeling flow refinement steps as actions in an expanded Markov decision process and applying virtual on-policy trajectories via reversal, RQL enables effective offline learning without backpropagation through time. Experiments on 50 robotic tasks show RQL achieves the best average performance among state-of-the-art flow-based offline RL methods.