A user shares detailed performance metrics for running the Qwen3.6 27B model on an RTX 5090, AMD 9800X3D, and 64GB RAM system using llama.cpp.

  • Tuning involved q8 KV cache, 192k context, MTP draft=10, spec-draft-p-min=0.5, and batch/ubatch 512.
  • Analysis of 6,454 samples over a mixed agentic coding session showed a mean throughput of 140.7 tok/s and median of 134.9 tok/s.
  • Peak performance reached the 120-130 tok/s bucket with a long tail extending up to 233 tok/s.
  • The author notes that hybrid attention/SWA cache handling in llama.cpp is not yet perfect for this model, causing prompt reprocessing warnings.

The post highlights that average numbers can hide performance variations, providing a real distribution of speeds rather than just a headline figure.