A user achieved over 50 tokens per second for GLM5.2 on their GH200 system by combining the MTP head from zai's FP8 repo with CyanKiwi's AWQ-INT4 quantized model. This hybrid approach, implemented via a merge script and patched vLLM, reached a best case of ~55 tok/sec at 4x concurrency and ~45 tok/sec for single inference, with streaming from RAM to VRAM.
Model hacks boost GLM5.2 speed from 2.5 to over 50 tok/s
from English