Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

Researchers propose Self-Aware Scheduling (SAS), a method that learns an optimal token unmasking order for masked diffusion language models to improve generation quality. By deriving a tractable upper bound on sequential decoding mismatch, the approach casts order selection as a policy optimization problem using Group Relative Policy Optimization.

SAS introduces a dense self-aware reward over ordered trajectories to guide the lightweight order policy.
The method applies seamlessly to both any-order and semi-autoregressive decoding modes.
On Sudoku with a 1B MDM, accuracy improved from 82.0% to 91.8%, reaching 97.5% with second-stage fine-tuning.
For mathematical reasoning with LLaDA-8B, pass@1 on GSM8K increased from 64% to 76% and on MBPP from 39.5% to 41%.

This approach provides a principled alternative to heuristic scheduling, consistently matching or exceeding baseline performance across various generation lengths and block sizes.

Benchmark	Model	Score
GSM8K	LLaDA-8B	76%
MBPP+	LLaDA-8B	41%

Benchmarks