Researchers introduce InfoKV, an entropy-aware framework that compresses key-value caches by combining token-level predictive uncertainty with attention scores to improve long-context reasoning.
- Introduces "Forward Influence" to measure how compressed tokens affect future contexts, revealing that high uncertainty tokens influence distant contexts more than attention-selected ones.
- Integrates layer-wise representation evolution and entropy scores with attention scores during the reasoning process.
- Experiments on Llama-3.1, Llama-3.2, and DeepSeek-R1 show consistent performance gains over existing methods in both long prefilling and decoding scenarios.
This approach addresses the limitation of relying solely on attention weights by incorporating information-theoretic signals, thereby enhancing the efficiency and effectiveness of large language models in handling long reasoning tasks.