Kamera: Training-Free Position-Invariant Multimodal KV Cache for Efficient Reuse

The authors introduce Kamera, a method that enables training-free reuse of multimodal key-value caches by addressing the loss of cross-chunk conditioning in naive prefix caching. Standard state-merge recovers direct readouts but fails to preserve the diffuse, low-rank residue in deep layers essential for multi-hop reasoning, which halves accuracy. To repair this, Kamera stores a small, training-free low-rank conditioning patch alongside each position-free chunk. This approach allows exact RoPE re-rotation and cross-chunk binding restoration across MLA, GQA, and MHA attention mechanisms. The system supports cheap reorder, sliding-window survival, and recall operations without requiring re-encoding of evicted chunks. Experiments show that a rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks like MM-NIAH and two-page doc-QA. The solution reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones while maintaining a fraction of the original KV footprint.