Self-Aware Scheduling Learns Token Unmasking Order in Diffusion Language Models

The authors propose Self-Aware Scheduling (SAS) to optimize the token unmasking order in masked diffusion language models, which significantly impacts generation quality. They derive a tractable upper bound on sequential decoding mismatch using Kullback-Leibler divergence and pathwise log-likelihood. This bound creates a dense self-aware reward that frames order selection as a policy optimization problem with a frozen denoiser. SAS learns a lightweight order policy via Group Relative Policy Optimization, supporting both any-order and semi-autoregressive decoding. On Sudoku tasks using a 1B parameter model, accuracy improved from 82.0% to 91.8%, reaching 97.5% after second-stage fine-tuning. For mathematical reasoning with LLaDA-8B, pass@1 on GSM8K increased from 64% to 76%. The method also raised MBPP scores from 39.5% to 41%, consistently matching or exceeding heuristic schedules across various parameters.

Benchmark	Model	Score
GSM8K	LLaDA-8B	76%
MBPP+	LLaDA-8B	41%

Benchmarks