A user demonstrated that the Gemma-4-31B-it model can handle an 80,000-token context window on an RTX 5090 GPU using llama.cpp, a significant increase from the typical 35k limit.
The configuration requires specific Docker flags to function correctly, including setting `GGML_CUDA_NO_PINNED=1` and enabling `--backend-sampling --parallel 1`. The setup also utilizes `--flash-attn on` and sets the context size explicitly via `--ctx-size 80000`.
This method allows users to extend the context length for Gemma-4 models beyond standard constraints by applying configuration tweaks previously noted for other architectures.