Researchers present CAT-Q, a post-training quantization scheme that compresses large language models into ternary precision without requiring costly quantization-aware training. The method utilizes learnable modulation and softened ternarization to achieve high accuracy using only 512 calibration samples.

  • CAT-Q employs learnable modulation to adjust weight distributions and thresholds, coupled with a differentiable transition function for stable convergence.
  • For models between 1.7B and 8B parameters, it outperforms BitNet v1 and v2 families while reducing training token requirements by approximately 100,000 times.
  • The approach successfully quantizes larger models ranging from 14B to 235B parameters into leading ternary models within 8 to 60 hours on eight A100 GPUs.

This method enables efficient compression and acceleration of diverse LLM architectures by significantly lowering the computational resources needed for quantization.