High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

The author demonstrates running the GLM-5.2 NVFP4 model on four NVIDIA GB10 DGX Spark nodes with a 128K context window, achieving usable serving performance through aggressive system optimization.

The model uses NVFP4 quantization for MoE expert FFNs while keeping attention and router in BF16, reducing the checkpoint size from 1.5 TB to 410 GB.
Performance reaches approximately 14.5-15.2 tokens per second on short-prompt codegen and maintains about 13 tok/s at long context lengths (32K-112K).
The setup requires a custom vLLM fork with DCP and B12X sparse MLA patches, alongside a heavily pruned Ray configuration to fit within the unified memory constraints.
BF16 KV cache at 128K context did not fit with sufficient headroom, necessitating the use of fp8_kv_cache and specific OS service disabling.

This guide provides a viable path for deploying large-scale models on Spark hardware by combining decode-context parallelism with significant memory trimming, though it is noted as a niche setup unsuitable for batch serving.

Benchmarks