EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction

The authors propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that dynamically adjusts speculation depth during LLM inference based on local generation entropy. This approach addresses the inefficiency of static tree-based attention topologies by matching compute requirements to context predictability.

EntMTP toggles between task-specific Pareto-optimal trees conditioned on running estimates of local generation entropy.
The method maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing quality.
Benchmarks include Humaneval, ShareGPT, GSM8k, and Litbench.
EntMTP achieves a consistent 1.15x speedup against Hydra baselines.
Peak speedup reaches 1.36x compared to Medusa baselines.

By aligning speculation depth with the entropy patterns of natural language, EntMTP optimizes inference efficiency in both low-entropy and high-entropy regions.