The llama.cpp release b9866 enables topk-moe fusion for models with 288 experts, such as Step-3.7-Flash, which previously fell back to an unfused routing chain. This change adds the missing template instantiation to accept 288 in the eligibility check, as it is a multiple of the warp size.

  • Measured on gfx1151 with Step-3.7-Flash IQ4_XS, decode throughput (tg128) increased by +2.4% at shallow context.
  • Prompt processing (pp4096) remains unchanged as the fusion only affects decode routing.
  • The performance gain fades with depth; by 30k tokens, steps become attention-bound over the KV cache.

This optimization improves inference speed for specific Mixture-of-Experts models on CUDA hardware during the decoding phase.