JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup
JetSpec introduces a speculative decoding method called causal parallel tree drafting that co-optimizes drafting cost and quality to reduce LLM generation latency. The approach achieves up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while maintaining lossless accuracy.