UBP2 introduces a model-based method that actively explores environments by jointly reasoning over uncertainties in reward, dynamics, and value functions. It achieves superior sample efficiency in preference-based reinforcement learning, outperforming both model-free and non-optimistic model-based baselines on the Meta-World benchmark.
UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based RL
from English