A user testing Bartowski's DeepSeek-V4-Flash-MXFP4 GGUF in llama.cpp build 9851 found that changing the KV cache type from f16 to q8_0 reduces the CUDA0 compute buffer usage by approximately 3.26x.
- Switching from f16 to q8_0 reduced total KV cache from ~425 MiB to ~226 MiB.
- The same change lowered the compute buffer from 12,964 MiB to 3,973 MiB.
- This reduction prevents out-of-memory errors on 32GB cards when using high context lengths like 32000.
Forcing q8_0 cache quantization allows the model to load successfully in scenarios where f16 would exceed available VRAM.