SJTU-DENG-Lab proposes Multi-Block Diffusion Language Models with Multi-block Teacher Forcing

Researchers from SJTU-DENG-Lab introduce Multi-Block Diffusion Language Models (MBD-LMs) to extend Single-Block Diffusion models by decoding a running set of consecutive blocks concurrently for inter-block parallelism. The approach uses Multi-block Teacher Forcing (MultiTF) during post-training to bridge the gap between training states and inference, alongside an optimized decoding algorithm based on the Block Buffer mechanism.

MBD-LMs utilize Multi-block Teacher Forcing to train on bounded noise-groups conditioned on clean prefixes with randomized noise-schedulers.
The Block Buffer mechanism preserves prefix-cache reuse and keeps input shapes static to translate parallelism into wall-clock acceleration.
MBD-LLaDA2-Mini increases average Tokens Per Forward pass (TPF) from 3.47 to 6.19 and improves average accuracy from 79.95% to 81.03%.
When combined with DMax, MBD-LLaDA2-Mini-DMax reaches an average TPF of 9.34 with only a 1.02% accuracy drop on math and code benchmarks.

This work enables practical execution of multi-block diffusion by aligning training distributions with inference states while significantly boosting decoding throughput.