A user demonstrates a disaggregated inference pipeline using a DGX Spark for prefilling and a Strix Halo box for token generation, achieving significant speedups for long-context workloads. By offloading the computationally intensive prompt processing to the DGX while leveraging the Strix's memory bandwidth for decoding, the setup overcomes the performance degradation seen when running solo on the Strix.
- The pipeline runs Qwen 3.5 122B (MTP) GGUF across both devices using llama.cpp and EXO.
- Token generation speeds are nearly identical between the two machines, with only a 13-15% advantage for the DGX Spark.
- Disaggregated prefilling yields speedups ranging from 2.8x to 4.4x compared to running end-to-end on the Strix Halo.
- The Strix's standalone prompt processing drops from 275 t/s at short contexts to 140 t/s at 127k tokens, whereas the DGX handles this load efficiently.
This approach allows users to utilize high-performance prefilling hardware without wasting its compute budget on token generation, effectively solving the bottleneck of long-context agentic loops.