A user demonstrates running the Qwen3.6-27 model in Q8_0 quantization with up to 115,000 tokens of context on a system with 32GB of VRAM. By experimenting with different key-value (KV) cache quantization levels alongside the model weights, they achieved stable inference using llama-server and draft-mtp speculative decoding.

  • Option 1 used Q8_0 KV cache to support 95K context, achieving an aggregate token speed of 141.6 tok/s on code generation tasks.
  • Option 2 reduced the KV cache to Q5_1 to extend context to 105K tokens, maintaining similar performance with a 142.0 tok/s rate.
  • Option 3 further lowered the KV cache to Q4_0 to reach 115K context, resulting in an aggregate accept rate of 0.6969 and 138.7 tok/s for code generation.

The configuration allows users to push context limits significantly beyond typical constraints on consumer-grade hardware by balancing model weight precision with KV cache quantization.