A user upgraded a budget PC with two RTX 3090s and an Intel Arc A770 to test multi-GPU inference performance using llama.cpp. The primary finding is that the Vulkan backend causes excessive memory overhead compared to CUDA, making it unsuitable for mixed-vendor setups.
- The system consists of 2x Zotac RTX 3090 (24 GB), 1x Intel Arc A770 (16 GB), an AMD Ryzen 5 1600X, and 48 GB DDR4 RAM.
- Using CUDA with two RTX 3090s allows running Qwen 3.6 27b Q8_K_XL bf16 cache with a 170k context at 30 tokens/s.
- Vulkan adds approximately 5 GB of memory overhead per 24 GB card, leaving little space for context in mixed setups.
- Running the same model on three GPUs via Vulkan resulted in only 3 tokens/s and required 21.7 GB VRAM before loading the KV cache.
The author concludes that users should stick to a single GPU vendor and use their native backend rather than attempting multi-GPU inference with Vulkan.