The authors propose Entropy-guided Multi-Token Prediction (EntMTP), a training-free scheduler that dynamically adjusts speculation depth during LLM inference based on local generation entropy. This approach addresses the inefficiency of static tree-based attention topologies by matching compute requirements to context predictability.

  • EntMTP toggles between task-specific Pareto-optimal trees conditioned on running estimates of local generation entropy.
  • The method maximizes expected accepted-token throughput across the full distribution of generated text without sacrificing quality.
  • Benchmarks include Humaneval, ShareGPT, GSM8k, and Litbench.
  • EntMTP achieves a consistent 1.15x speedup against Hydra baselines.
  • Peak speedup reaches 1.36x compared to Medusa baselines.

By aligning speculation depth with the entropy patterns of natural language, EntMTP optimizes inference efficiency in both low-entropy and high-entropy regions.