Researchers propose GRINQH, a weight-only post-training quantization framework that accelerates large language model decoding by unifying quantization and sparsification. The method leverages activation magnitudes to dynamically assign weight channels to different precision levels, addressing the memory-bound nature of the decoding stage.
- Utilizes activation magnitudes as a proxy for computational importance to enable flexible average bit widths during decoding.
- Implements a hierarchical nested memory layout for multi-precision storage within a custom GPU kernel.
- Outperforms state-of-the-art fixed- and mixed-precision baselines at comparable 3- and 4-bit settings on Llama3 and Qwen3 models.
- Enables effective 2-bit generation while establishing a new Pareto frontier for the trade-off between generation quality and inference speed.