Quantization Impact on MTP Draft Acceptance Rates

An analysis of speculative decoding using Gemma 4-31B-it models demonstrates that heavy quantization reduces the token acceptance rate because the main model becomes less consistent with the drafter. Testing across Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M quantizations reveals how draft depth affects performance.

Acceptance rates decline as draft depth increases for all quantization levels tested.
Q5_K_S provides the highest fidelity, while IQ4_XS and IQ3_M perform nearly identically.
Even the 2-bit IQ2_M maintains high acceptance rates for single-token drafts (84.5% at n=1).
Hardware architecture significantly influences speedup gains, with CUDA devices benefiting most from draft depth n=2.

The study indicates that lower bit-rates can still support speculative decoding effectively, allowing users to run the 31B trunk model with as little as 12 GB of memory using IQ2_M quantization.