A user investigates optimizing the Qwen3.6-27B model on a dual AMD Radeon R9700 setup using llama.cpp, comparing performance between Vulkan and ROCm backends.
- ROCm achieves significantly higher prefill throughput (1355 tokens/s) by saturating both GPUs, whereas Vulkan only utilizes one GPU at a time (682.7 tokens/s).
- Token generation speed is slightly faster with Vulkan (24.55 tokens/s) compared to ROCm (22.3 tokens/s), though ROCm leaves the second GPU partially idle during this phase.
- Using `split-mode = tensor` evens out GPU usage but results in lower performance due to potential PCIe bandwidth limitations.
The author seeks community advice on further tuning parameters or alternative engines like vLLM to maximize token generation throughput.