A Reddit user is considering purchasing four Ascend GX10 GPUs to prepare for running a future open-source "fable 5" model, citing performance benchmarks from other users who tested GLM5.2 on similar hardware.

  • Benchmarks show GLM5.2 achieves 400-500 tokens per second for prompt processing and approximately 15 tokens per second for output at a 128k context length on four DGX Sparks or Ascend GX10s.
  • The setup draws around 1000W of power, which the user notes is manageable.
  • Quantization is suggested as a method to improve usability given the current inference speeds.