A benchmark of 13 local LLMs running on an RX 7900 XT reveals that for agentic workflows with contexts between 65K and 128K, the prefill phase consumes 94–99% of wall-clock time, rendering token generation speed largely irrelevant.

  • The test used llama.cpp build 9860 with Vulkan backend across dense, MoE, Mamba2 hybrid, and MLA MoE models ranging from 5GB to 18GB.
  • Trinity-Mini (MoE 3B/26B) achieved the highest prefill speed at 923 tokens/sec for 131K context, while GLM-4.7-Flash crashed above 16K due to MLA constraints.
  • Devstral-24B could not complete the 131K test because its KV cache requirements exceeded the GPU's VRAM capacity.

The findings suggest that optimizing for prefill performance and managing KV cache size are more critical than parameter count or generation speed when handling long-context agentic tasks.