JetSpec introduces a speculative decoding method called causal parallel tree drafting that co-optimizes drafting cost and quality to reduce LLM generation latency. The approach achieves up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while maintaining lossless accuracy.
JetSpec drafts a causality-preserving tree in a single pass, addressing the dilemma faced by prior speculative decoding methods where autoregressive heads incur high costs or block-diffusion heads produce inconsistent branches.
With CUDA graph and kernel optimizations, JetSpec translates to approximately 1000 tokens per second (TPS) on a single B200 GPU.
This method enables significant inference speedups without sacrificing output quality, offering a practical solution for high-throughput large language model deployment.