A user reports that running the Bartowski quantized DeepSeek-V4-Flash model in MXFP4 format on a CPU-only system yields disappointing performance. Despite having 512GB of DDR4 memory, the setup only achieved 3.2 tokens per second.
- The user tested the configuration on an E5-2699v4 processor with a GTX 1060 used for offloading.
- Performance was compared against GLM 5.2 (40B active parameters in Q4_K_XL), which ran at 1.8 t/s.
- The user suspects the MXFP4 format is causing the bottleneck, estimating effective memory bandwidth around 20GB/s.
The post highlights potential efficiency issues with specific quantization formats for CPU inference and seeks alternative Q4 quantizations.