A user demonstrates running the NVFP4-quantized Qwen3.6-35B-A3B model on an RTX Pro 6000 Blackwell GPU, achieving approximately 2000 tokens per second in aggregate throughput while handling 30 concurrent image captioning streams. The configuration utilizes vLLM with the FLASHINFER attention backend and prefix caching to manage high concurrency. The Mixture of Experts (MoE) architecture activates only about 53-61% of experts even at high concurrency levels, allowing it to outperform dense models despite its larger parameter count. This setup proves that NVFP4 quantization on Blackwell hardware can efficiently handle multimodal workloads with significant parallelism without exhausting VRAM.
NVFP4 Qwen3.6-35B-A3B on Blackwell achieves ~2000 tps with 30 concurrent streams
from English