A developer benchmarked Qwen 3.6 27B using vLLM on an RTX 6000 Pro Blackwell, comparing BF16, FP8, and NVFP4 quantizations to evaluate performance trade-offs for coding tasks.
- NVFP4 dominates token generation speed, achieving approximately 2.6x faster throughput than BF16 due to reduced memory bandwidth requirements.
- FP8 wins on prompt processing and prefill speed, offering about a 20% speedup over BF16 by leveraging native Tensor Core acceleration without dequantization overhead.
- NVFP4 suffers a slight prefill penalty compared to FP8 because it must dequantize weights on the fly during compute-heavy batches.
- The author found FP8 to be the best overall choice for coding purposes, noting that while NVFP4 is faster, it caused looping issues and less thorough responses in agent mode.
The results suggest that while NVFP4 offers superior decoding speed, FP8 provides a better balance of performance and stability for practical application use.