Pareto Q-Learning with Reward Machines

PQLRM is a multi-objective reinforcement learning algorithm that combines Pareto Q-Learning with Reward Machines to handle non-Markovian rewards. It converges faster than naive PQL on cross-product MDPs and generates Pareto-optimal policies beyond the capability of QRM.