A user tested the unsloth quantized version of GLM 5.2 on a high-end consumer workstation featuring dual RTX 5090 GPUs and a Zen5 Threadripper Pro processor. The system utilized 512GB of DDR5 ECC RAM and was configured with specific llama.cpp compilation flags to enable CUDA optimizations and unified memory handling. The model weights were loaded from the UD-Q5_K_S quantization, which totaled approximately 492GB across multiple GGUF files. Performance testing involved running the llama-server with a context size of 32768 tokens and specific threading parameters for NUMA isolation. The benchmark results consistently showed an inference speed of 12 tokens per second during chat interactions without agentic workflows. Additional experiments revealed that omitting certain optimization flags, such as flash attention or NUMA settings, produced negligible changes in throughput.
GLM 5.2 runs at 12t/s on dual RTX 5090 hardware
from English