A developer has implemented a CUDA kernel and wired the DSA lightning indexer into llama.cpp, enabling local inference of the DeepSeek V4 Flash model with full 1M token context on consumer hardware like the RTX 5090.
- The patch reduces compute buffer requirements from ~67 GiB to 3.2 GiB at 256K context and allows 1M context usage with only 3.75 GiB VRAM.
- Prefill speeds increase significantly, reaching ~263 tokens/s at 256K context compared to the previous 56 tokens/s.
- Correctness was verified using needle-in-haystack tests at 10%, 50%, and 90% depths across 100K, 512K, and 1M token documents.
- The changes are available in a custom branch with build instructions, as no prebuilt binaries are provided.
This work allows users to run large-context DeepSeek V4 Flash locally without requiring absurd amounts of VRAM.