Researchers present CAT-Q, a post-training quantization scheme that compresses large language models into ternary precision without requiring costly quantization-aware training. The method utilizes learnable modulation and softened ternarization to achieve high accuracy using only 512 calibration samples.
- CAT-Q employs learnable modulation to adjust weight distributions and thresholds, coupled with a differentiable transition function for stable convergence.
- For models between 1.7B and 8B parameters, it outperforms BitNet v1 and v2 families while reducing training token requirements by approximately 100,000 times.
- The approach successfully quantizes larger models ranging from 14B to 235B parameters into leading ternary models within 8 to 60 hours on eight A100 GPUs.
This method enables efficient compression and acceleration of diverse LLM architectures by significantly lowering the computational resources needed for quantization.