iLLaDA: An 8B Masked Diffusion Language Model with Fully Bidirectional Attention

The authors introduce iLLaDA, an 8B parameter masked diffusion language model trained from scratch using fully bidirectional attention. This approach contrasts with the predominant autoregressive factorization and causal attention used in modern large language models. The model's pre-training scaled to 12 trillion tokens, followed by supervised fine-tuning on a 25 billion-token instruction corpus for 12 epochs. iLLaDA maintains the masked diffusion objective throughout both training phases and employs variable-length generation for efficiency. It also introduces confidence-based scoring to enhance performance on multiple-choice evaluation tasks. Benchmark results show significant improvements over its predecessor, LLaDA, including gains of 21.6 points on BBH and 14.9 points on ARC-Challenge for the base model. The instruction-tuned variant achieved increases of 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive nature, iLLaDA remains competitive with Qwen2.5 7B across several metrics.

Benchmark	Model	Score
AIME 2025	iLLaDA-Instruct	14.5pts
BIG-Bench Hard	iLLaDA-Base	21.6pts

Benchmarks