llama.cpp KV cache quantization cuts DeepSeek-V4-Flash compute buffer by 3x

A user testing Bartowski's DeepSeek-V4-Flash-MXFP4 GGUF in llama.cpp build 9851 found that changing the KV cache type from f16 to q8_0 reduces the CUDA0 compute buffer usage by approximately 3.26x.

Switching from f16 to q8_0 reduced total KV cache from ~425 MiB to ~226 MiB.
The same change lowered the compute buffer from 12,964 MiB to 3,973 MiB.
This reduction prevents out-of-memory errors on 32GB cards when using high context lengths like 32000.

Forcing q8_0 cache quantization allows the model to load successfully in scenarios where f16 would exceed available VRAM.