CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

Researchers present CAT-Q, a post-training quantization scheme that compresses large language models into ternary precision without requiring costly quantization-aware training. The method utilizes learnable modulation and softened ternarization to achieve high accuracy using only 512 calibration samples.

CAT-Q employs learnable modulation to adjust weight distributions and thresholds, coupled with a differentiable transition function for stable convergence.
For models between 1.7B and 8B parameters, it outperforms BitNet v1 and v2 families while reducing training token requirements by approximately 100,000 times.
The approach successfully quantizes larger models ranging from 14B to 235B parameters into leading ternary models within 8 to 60 hours on eight A100 GPUs.

This method enables efficient compression and acceleration of diverse LLM architectures by significantly lowering the computational resources needed for quantization.