UltraQuant: 4-bit KV Caching for Context-Heavy Agents
UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.