A follow-up investigation into running GLM-5.2 NVFP4 on four DGX Spark nodes resolves a previous performance bottleneck where high acceptance rates were impossible at 128K context.
The root cause was a bug in vLLM's `SpeculativeConfig.create_draft_parallel_config()` which failed to copy `decode_context_parallel_size`, causing draft layers to ignore DCP sharding. This resulted in attention mechanisms processing local cache fragments as global data, leading to collapsed acceptance rates for MTP2 and MTP3.
- Performance improved from ~15 tok/s to ~24 tok/s at 128K context using DCP4 and MTP3/MTP4.
- MTP acceptance rates per position reached 0.90, 0.79, and 0.67 for the first three speculative tokens.
- The fix involved adding a missing configuration line to mirror upstream logic and rebasing onto a newer vLLM branch.
This resolution eliminates the previous trade-off between context length and speed, allowing users to run full 128K context with high throughput on this hardware configuration.