A user testing self-hosted Qwen 3.6 27B and Gemma 4 models on four RTX 5070 Ti cards reports that Multi-Token Prediction (MTP) degrades output quality compared to non-MTP variants. In code review tasks, the non-MTP model produced more detailed findings with fix suggestions while consuming fewer tokens than its MTP counterpart. Performance metrics showed the non-MTP setup achieving approximately 2000 prompt processing tokens per second and 50-60 token generation speed. Conversely, the MTP configuration yielded higher generation speeds of 100-120 tg/s but lower prompt processing rates around 1300 pp/s. Despite the higher generation throughput, real-world agent task completion times were only about 20% faster with MTP due to increased context consumption. The user utilized llama.cpp with specific GGUF files from Unsloth and noted similar negative experiences when testing Gemma 4.
User Reports Inferior Quality and Efficiency with MTP Models in Qwen 3.6 and Gemma 4
from English