DeepSeek-V4-Flash MXFP4 quantization runs slowly on CPU

A user reports that running the Bartowski quantized DeepSeek-V4-Flash model in MXFP4 format on a CPU-only system yields disappointing performance. Despite having 512GB of DDR4 memory, the setup only achieved 3.2 tokens per second.

The user tested the configuration on an E5-2699v4 processor with a GTX 1060 used for offloading.
Performance was compared against GLM 5.2 (40B active parameters in Q4_K_XL), which ran at 1.8 t/s.
The user suspects the MXFP4 format is causing the bottleneck, estimating effective memory bandwidth around 20GB/s.

The post highlights potential efficiency issues with specific quantization formats for CPU inference and seeks alternative Q4 quantizations.