A user reports increased performance for DeepSeek V4 Pro running locally via a custom llama.cpp branch containing various fixes and optimizations. The article shares benchmark results from an Epyc 9374F system with RTX PRO 6000 Max-Q, noting that the model's memory usage remains high in mainline builds.
- Benchmark tests were conducted using a 794GB GGUF file on hardware with 12 x 96GB DDR5 RAM and 96GB VRAM.
- The custom branch resolves issues with excessive memory consumption caused by lightning indexer compute buffers and CUDA top-k temporary buffers.
- Mainline llama.cpp currently has broken quantized KV cache support and potential bugs regarding prompt cache reuse and batch preparation.
The author highlights that while their specific optimizations improve speed, users relying on mainline llama.cpp may encounter significant memory overhead and functional bugs.