A new mechanism called Token Importance Scoring (TIS) applies constraint-aware learning to identify and retain important tokens for efficient KV cache compression in large language models. The approach utilizes hard anchor forcing to prevent trivial optimization paths, allowing gradient descent to effectively determine token significance.
- Achieves 100% accuracy on the NIAH synthetic retrieval task with a learned model at a 50% cache budget.
- Reaches 52.8% on the LITM semantic QA benchmark at 50% budget without query-specific training.
- Three checkpoints are available, including a main model (tis-stage3-ert) and an extreme compression variant (tis-v8b-hard-anchor).
- Validated on consumer hardware, specifically running on an RTX 5070 with 8GB VRAM using Mistral-7B-v0.3.
The system demonstrates that learned importance can match oracle performance on structural tasks while remaining feasible for consumer GPUs.