BlockPilot introduces a sample-adaptive policy for diffusion-based speculative decoding that dynamically predicts the optimal inference block size based on prefilling representations. This approach addresses the suboptimality of fixed block sizes by leveraging the local structure of optimal values around the training block size.
- Formulates block size selection as a lightweight policy learning problem with an instance-adaptive decision mechanism.
- Performs prediction only once after the prefilling stage, allowing for seamless integration and minimal overhead.
- Achieves an acceptance length of 5.92 and a 4.20× speedup on Qwen3-4B under temperature T=1.
The method is described as plug-and-play, consistently improving efficiency without requiring significant computational resources or architectural changes.