GLM-5.2 NVFP4 on four DGX Sparks — the MTP mystery is solved, and it's now ~24 tok/s at 128K context

A follow-up investigation into running GLM-5.2 NVFP4 on four DGX Spark nodes resolves a previous performance bottleneck where high acceptance rates were impossible at 128K context.

The root cause was a bug in vLLM's `SpeculativeConfig.create_draft_parallel_config()` which failed to copy `decode_context_parallel_size`, causing draft layers to ignore DCP sharding. This resulted in attention mechanisms processing local cache fragments as global data, leading to collapsed acceptance rates for MTP2 and MTP3.

Performance improved from ~15 tok/s to ~24 tok/s at 128K context using DCP4 and MTP3/MTP4.
MTP acceptance rates per position reached 0.90, 0.79, and 0.67 for the first three speculative tokens.
The fix involved adding a missing configuration line to mirror upstream logic and rebasing onto a newer vLLM branch.

This resolution eliminates the previous trade-off between context length and speed, allowing users to run full 128K context with high throughput on this hardware configuration.