A user is attempting to run the Qwen3.5 122B model using llama-server on a system equipped with an RTX 5090 GPU and 64GB of RAM. The reported inference speed starts at approximately 6 tokens per second (tps) and gradually increases to around 20 tps during generation.
- Hardware configuration: NVIDIA RTX 5090 with 32GB VRAM and 64GB system RAM.
- Model variant: Qwen3.5-122B-A10B quantized as Q5_K_S.
- Performance metrics: Initial throughput of ~6 tps rising to ~20 tps over the course of the generation.
- Inference settings: Utilized llama-server with flash attention enabled, 16 threads, and a context length of 100,000 tokens.
The user is seeking advice on how to further optimize this setup to achieve higher token generation speeds.