A user reports a significant drop in inference speed when switching from GPT-OSS 20B Q4 to Gemma 4 12B Q8 using llama.cpp, with throughput falling from approximately 70 tokens per second to 10 tokens per second. The issue persists even when testing a Q5 model variant and disabling the thinking feature, which only yielded a marginal gain of two additional tokens per second.

  • Hardware: NVIDIA RTX 4000 SFF Ada Generation (20GB VRAM) with 13th Gen Intel Core i5-13500 CPU.
  • Model: Gemma 4 12B IT loaded as GGUF (Q5_K_XL), consuming 10GB of GPU memory.
  • Configuration: llama-server running with `--threads 16`, `--ctx-size 8192`, and `--n-gpu-layers 99`.
  • Warnings: Logs indicate deprecated `enable_thinking` kwargs, control token type mismatches, and a context size (8192) much smaller than the model's training capacity (262144).

The user is seeking troubleshooting advice for this performance regression in their llama.cpp service setup.