A user is attempting to run the Qwen3.5 122B model using llama-server on a system equipped with an RTX 5090 GPU and 64GB of RAM. The reported inference speed starts at approximately 6 tokens per second (tps) and gradually increases to around 20 tps during generation.

  • Hardware configuration: NVIDIA RTX 5090 with 32GB VRAM and 64GB system RAM.
  • Model variant: Qwen3.5-122B-A10B quantized as Q5_K_S.
  • Performance metrics: Initial throughput of ~6 tps rising to ~20 tps over the course of the generation.
  • Inference settings: Utilized llama-server with flash attention enabled, 16 threads, and a context length of 100,000 tokens.

The user is seeking advice on how to further optimize this setup to achieve higher token generation speeds.