HyperQuant is a unified post-training quantization pipeline designed for the weights and KV cache of large language and diffusion transformers, combining Hadamard transforms with optimal lattice quantization. The method outperforms recent schemes like HIGGS, TurboQuant, and OCTOPUS across various bit rates while maintaining near-lossless quality.

  • HyperQuant combines per-tile Randomized Hadamard Transforms, low-dimensional optimal lattice quantization (E8, D4, A2, or Z), lossless bit-stripping, and Rice coding to approximate Gaussian distributions for weights and activations.
  • It achieves superior performance compared to HIGGS at 3 to 5 bits per scalar on weights and beats TurboQuant and OCTOPUS on KV quantization down to 1.7 bps.
  • The pipeline integrates with 8-bit and 4-bit Tensor-Core MMA paths, finding that int8 outperforms fp8 on post-RHT lattice output.
  • End-to-end testing on an H100 at 4 bps compresses linear weights by approximately 3.9x and the KV cache by 3.79x with no observable artifacts in video models like LTX-2.

This approach enables efficient compression of large models without significant quality loss, preserving attention semantics through bias-correction methods for the KV cache.