Wasserstein Policy Learning for Distributional Outcomes
This paper introduces offline policy learning for distribution-valued outcomes, where rewards are derived from utility functionals applied to Wasserstein barycenters. It establishes statistical guarantees using IPW and DR estimators, proving finite-sample regret with leading dependence \widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N}) and provides a minimax lower bound confirming the sharpness of this rate.