Researchers propose Self-Aware Scheduling (SAS), a method that learns an optimal token unmasking order for masked diffusion language models to improve generation quality. By deriving a tractable upper bound on sequential decoding mismatch, the approach casts order selection as a policy optimization problem using Group Relative Policy Optimization.

  • SAS introduces a dense self-aware reward over ordered trajectories to guide the lightweight order policy.
  • The method applies seamlessly to both any-order and semi-autoregressive decoding modes.
  • On Sudoku with a 1B MDM, accuracy improved from 82.0% to 91.8%, reaching 97.5% with second-stage fine-tuning.
  • For mathematical reasoning with LLaDA-8B, pass@1 on GSM8K increased from 64% to 76% and on MBPP from 39.5% to 41%.

This approach provides a principled alternative to heuristic scheduling, consistently matching or exceeding baseline performance across various generation lengths and block sizes.