Off-Policy Evaluation for MNAR Rewards in MDPs
We propose an off-policy evaluation method for finite-horizon MDPs with rewards missing not at random. Our approach uses a reward-dependent propensity model and a bridge function to recover conditional mean rewards without modeling the MNAR mechanism, achieving consistency and finite-sample error bounds. Experiments on simulated and MIMIC-III Sepsis data show superior performance over existing methods.