The authors propose CompressKV, a framework that compresses key-value caches in GQA-based large language models by identifying semantic retrieval heads to retain critical tokens. This approach addresses the performance degradation caused by existing heuristic eviction methods that ignore the distinct functionalities of attention heads.

  • CompressKV identifies Semantic Retrieval Heads (SRHs) that capture initial, final, and semantically important mid-context tokens to select KV pairs for retention.
  • The framework allocates cache budgets across layers based on offline estimates of layer-wise eviction error.
  • On LongBench question-answering tasks, CompressKV preserves over 97% of full-cache performance using only 3% of the KV cache.
  • It achieves 90% accuracy on Needle-in-a-Haystack with just 0.7% KV storage.

This method demonstrates an improved resource-performance trade-off for long-context LLM inference, enabling sustainable deployment on resource-constrained hardware.