A user has successfully optimized the DeepSeek V4 Flash model to run on an NVIDIA GeForce RTX 5090 using a specific fork of llama.cpp. The configuration supports a 1 million token context window while retaining some VRAM headroom.
- Benchmark results show Token Generation (TG) throughput dropping from 22.7 to 21.3 tokens/second and Prompt Processing (PP) throughput decreasing from 1105 to 927 tokens/second.
- The setup utilizes a Q2_K quantized GGUF model, MoE with no unified KV cache, and sets n-cpu-moe to 37.
- The user achieved a 1 million context size by using an unbatched size (ub) of 512, fitting within the RTX 5090's memory constraints.
- Optimization required a custom llama.cpp fork from GitHub user fairydreaming and specific CMake build flags for CUDA architecture 120.
This configuration demonstrates that DeepSeek V4 Flash can operate with massive context windows on consumer hardware, albeit with reduced throughput compared to baseline metrics.