Understanding Evaluation Illusion in Diffusion Large Language Models

A study reveals that evaluating diffusion large language models (dLLMs) is highly sensitive to prompt templates, creating an illusion that parallel decoding improves efficiency without performance loss.

Current parallel decoding methods consistently underperform the single-token decoding baseline and fail to overcome the speed-quality trade-off.
The ranking of decoding methods is highly sensitive to minor variations in prompt templates, leading to inconsistent evaluation results.
An effective prompt template can achieve strong results with fewer denoising steps, outperforming the marginal gains from increasing those steps.

These findings highlight the need for reliable evaluation guidelines to prevent biased conclusions about dLLM decoding methods.