A user demonstrates running StepFun's 198B-parameter Step-3.7-Flash model on a consumer 4×RTX 3090 setup, revealing critical performance trade-offs between quantization levels and multi-token prediction (MTP) with vision capabilities.
- IQ3_XXS (72GB) runs fully in VRAM at 65 tokens/s, outperforming the larger IQ4_XS (99GB) which spills to CPU at 33 tokens/s, achieving a 2.4x speedup.
- MTP speculative decoding provides a +25% text speed boost but causes hard aborts when processing images because the draft context cannot decode image tokens.
- Adding the MTP draft head forces a VRAM spill unless KV cache is downgraded to q4_0, which frees ~4.5GB to keep all components resident.
- The model requires specific sampling parameters (temp 1.0 / top_p 0.95) and a reasoning budget cap to prevent infinite loops in llama.cpp.
The findings indicate that for MoE models, ensuring full VRAM residency is more impactful than higher quantization precision, and MTP is currently incompatible with multimodal tasks due to engine-level limitations.