An analysis of speculative decoding using Gemma 4-31B-it models demonstrates that heavy quantization reduces the token acceptance rate because the main model becomes less consistent with the drafter. Testing across Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M quantizations reveals how draft depth affects performance.
- Acceptance rates decline as draft depth increases for all quantization levels tested.
- Q5_K_S provides the highest fidelity, while IQ4_XS and IQ3_M perform nearly identically.
- Even the 2-bit IQ2_M maintains high acceptance rates for single-token drafts (84.5% at n=1).
- Hardware architecture significantly influences speedup gains, with CUDA devices benefiting most from draft depth n=2.
The study indicates that lower bit-rates can still support speculative decoding effectively, allowing users to run the 31B trunk model with as little as 12 GB of memory using IQ2_M quantization.