Researchers from SJTU-DENG-Lab introduce Multi-Block Diffusion Language Models (MBD-LMs) to extend Single-Block Diffusion models by decoding a running set of consecutive blocks concurrently for inter-block parallelism. The approach uses Multi-block Teacher Forcing (MultiTF) during post-training to bridge the gap between training states and inference, alongside an optimized decoding algorithm based on the Block Buffer mechanism.
- MBD-LMs utilize Multi-block Teacher Forcing to train on bounded noise-groups conditioned on clean prefixes with randomized noise-schedulers.
- The Block Buffer mechanism preserves prefix-cache reuse and keeps input shapes static to translate parallelism into wall-clock acceleration.
- MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%.
- When combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.
This work enables practical execution of multi-block diffusion by aligning training distributions with inference states while significantly boosting decoding throughput.