Ternary Mamba: Pretrained QAT for Efficient SSM Compression
Ternary Mamba achieves 3.61x compression of Mamba-2 using grouped quantization-aware training from a pretrained checkpoint, reducing memory from 2,687 to 744 MB. It reaches 48.1% zero-shot accuracy with only 102M tokens and 4 GPU-hours, matching Bi-Mamba within 0.9 percentage points, while revealing new instability from learnable quantization scales and error accumulation in recurrence.