All articles
media r/LocalLLaMA · 9h ago

Reddit user proposes combining RTX 5080 and 4060 for local LLM inference

A Reddit user in the r/LocalLLaMA community is considering upgrading their hardware to improve inference speed and capacity for Qwen models by pairing a future RTX 5080 with their existing RTX 4060. The user aims to achieve at least 20-40 tokens per second while running Qwen 27B models, utilizing the combined 24GB of VRAM through tensor or layer splitting in llama.cpp or vLLm. They are evaluating this asymmetric dual-GPU setup against other options like the AMD R9700 AI Pro or 7900XTX, citing benchmark data that suggests limited performance gains for the AMD cards relative to their cost.