HyperDFlash is a block-parallel speculative decoding framework designed to address feature misalignment issues when adapting DFlash to DeepSeek-V4's multi-hyper-connection (MHC) architecture. The authors propose two key optimizations: using pre-collapse residual states for conditioning and replacing the generic linear compressor with a lightweight gated residual reducer inherited from the model's hyper-connection head.
- Utilizes pre-collapse residual states as the exclusive conditioning signal to preserve multi-path structural information and align with the target model's native prediction pathway.
- Replaces heavy generic linear compressors with a lightweight gated residual reducer that has three orders of magnitude fewer parameters while maintaining architectural alignment.
- Employs targeted KL distillation loss on the LM-head to regularize predictions against the full target probability distribution and improve draft quality during early training stages.
- Demonstrates consistent outperformance over native MTP baselines and vanilla DFlash adaptations across math reasoning, code synthesis, and conversational benchmarks.
The framework achieves substantial gains in average accepted draft length and decoding speedup, validating the effectiveness of its MHC alignment and gated reduction strategies for high-performance speculative decoding.