CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
The authors propose CompressKV, a framework that compresses key-value caches in GQA-based large language models by identifying semantic retrieval heads to retain critical tokens. This approach addresses the performance degradation caused by existing heuristic eviction methods that ignore the distinct functionalities of attention heads.