The authors propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that dynamically adjusts speculation depth during LLM inference based on local generation entropy. This approach addresses the inefficiency of static tree-based attention topologies by matching compute requirements to context predictability.
- EntMTP toggles between task-specific Pareto-optimal trees conditioned on running estimates of local generation entropy.
- The method maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing quality.
- Benchmarks include Humaneval, ShareGPT, GSM8k, and Litbench.
- EntMTP achieves a consistent 1.15x speedup against Hydra baselines.
- Peak speedup reaches 1.36x compared to Medusa baselines.
By aligning speculation depth with the entropy patterns of natural language, EntMTP optimizes inference efficiency in both low-entropy and high-entropy regions.