User fairydreaming has merged pull requests into their llama.cpp branch to enable quantized key-value (KV) cache support for the DeepSeek V4 model. The changes incorporate fixes from PRs #25247, #25303, and #25202, with some padding adjustments omitted.

  • The implementation supports Q8_0 and Q4_0 quantization types for KV caches.
  • Perplexity tests on WikiText-2 show minimal degradation compared to the f16 baseline.
  • Final perplexity scores were 4.0242 for f16, 4.0304 for Q8_0, and 4.0512 for Q4_0.

These updates allow users to run DeepSeek V4 with reduced memory usage via quantized caches while maintaining performance close to the full precision model.