An analysis of speculative decoding using Gemma 4-31B-it models demonstrates that heavy quantization reduces the token acceptance rate because the main model becomes less consistent with the drafter. Testing across Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M quantizations reveals how draft depth affects performance.

  • Acceptance rates decline as draft depth increases for all quantization levels tested.
  • Q5_K_S provides the highest fidelity, while IQ4_XS and IQ3_M perform nearly identically.
  • Even the 2-bit IQ2_M maintains high acceptance rates for single-token drafts (84.5% at n=1).
  • Hardware architecture significantly influences speedup gains, with CUDA devices benefiting most from draft depth n=2.

The study indicates that lower bit-rates can still support speculative decoding effectively, allowing users to run the 31B trunk model with as little as 12 GB of memory using IQ2_M quantization.