The author demonstrates running the GLM-5.2 NVFP4 model on four NVIDIA GB10 DGX Spark nodes with a 128K context window, achieving usable serving performance through aggressive system optimization.

  • The model uses NVFP4 quantization for MoE expert FFNs while keeping attention and router in BF16, reducing the checkpoint size from 1.5 TB to 410 GB.
  • Performance reaches approximately 14.5-15.2 tokens per second on short-prompt codegen and maintains about 13 tok/s at long context lengths (32K-112K).
  • The setup requires a custom vLLM fork with DCP and B12X sparse MLA patches, alongside a heavily pruned Ray configuration to fit within the unified memory constraints.
  • BF16 KV cache at 128K context did not fit with sufficient headroom, necessitating the use of fp8_kv_cache and specific OS service disabling.

This guide provides a viable path for deploying large-scale models on Spark hardware by combining decode-context parallelism with significant memory trimming, though it is noted as a niche setup unsuitable for batch serving.