A follow-up investigation into running GLM-5.2 NVFP4 on four DGX Spark nodes resolves a previous performance bottleneck where high acceptance rates were impossible at 128K context.

The root cause was a bug in vLLM's `SpeculativeConfig.create_draft_parallel_config()` which failed to copy `decode_context_parallel_size`, causing draft layers to ignore DCP sharding. This resulted in attention mechanisms processing local cache fragments as global data, leading to collapsed acceptance rates for MTP2 and MTP3.

  • Performance improved from ~15 tok/s to ~24 tok/s at 128K context using DCP4 and MTP3/MTP4.
  • MTP acceptance rates per position reached 0.90, 0.79, and 0.67 for the first three speculative tokens.
  • The fix involved adding a missing configuration line to mirror upstream logic and rebasing onto a newer vLLM branch.

This resolution eliminates the previous trade-off between context length and speed, allowing users to run full 128K context with high throughput on this hardware configuration.