For Retrieval-Augmented Generation (RAG), prefill throughput is the primary performance bottleneck rather than decode speed. This is because RAG queries inject thousands of tokens of retrieved context into every prompt, making the initial processing phase critical.
- On unified memory systems like Strix Halo, prefill throughput lags significantly behind discrete GPUs despite having adequate decode speeds for Mixture of Experts (MoE) models.
- While a single 24GB discrete card processes this context in seconds, unified memory setups can cause pauses of 20 to 60 seconds before the first token is generated.
- For users constrained by budget, it is recommended to select hardware with a free PCIe slot to allow for adding a discrete card later specifically to offload prefill tasks.
This distinction matters because interactive RAG workflows require rapid context processing, which unified memory architectures currently struggle to provide compared to dedicated graphics cards.