A new KV caching method dynamically allocates cache space between recently and frequently used blocks to improve efficiency. It boosts KV cache hit rate by up to 10.8% and reduces time to first token by up to 12.6% on synthetic workloads, with 2.1% and 2.0% gains on real-world conversation tasks.
Recency/Frequency Adaptive KV Caching for LLM Serving
from English