A user reports that Qwen 27B, quantized to q6kxl and running with multi-token prediction on a system with 4090 and 3090 GPUs, achieves decode speeds of 50-90 tokens/s and pre-fill speeds of 1500-2200 token/s. The model reliably interfaces with various APIs and generates functional code for single-page apps, LaTeX docs, parsers, and crawlers.

  • Model: Qwen 27B (q6kxl quantization)
  • Hardware: 4090+3090 system with 96GB VRAM
  • Decode speed: 50-90 tokens/s
  • Pre-fill speed: 1500-2200 token/s
  • Capability: Ingests decent-size codebases while maintaining existing schema for updates.

This configuration is highlighted as the first local model to offer reliable coherence and speed on this hardware without requiring extensive tuning of tools or harnesses.