User fairydreaming has merged pull requests into their llama.cpp branch to enable quantized key-value (KV) cache support for the DeepSeek V4 model. The changes incorporate fixes from PRs #25247, #25303, and #25202, with some padding adjustments omitted.
- The implementation supports Q8_0 and Q4_0 quantization types for KV caches.
- Perplexity tests on WikiText-2 show minimal degradation compared to the f16 baseline.
- Final perplexity scores were 4.0242 for f16, 4.0304 for Q8_0, and 4.0512 for Q4_0.
These updates allow users to run DeepSeek V4 with reduced memory usage via quantized caches while maintaining performance close to the full precision model.