A user compared Qwen3.6 27b, Gemma4 26B A4B QAT, and Ornith1.0 35B MoE using the inspect-ai framework on an RTX 3090 to evaluate local model performance. The testing revealed mixed results across general knowledge, grounding, and coding benchmarks, with Qwen3.6 generally leading in scores while Ornith showed strengths in specific areas like DROP.
- In General Knowledge and Reasoning, Qwen3.6 achieved the best or joint-best score in 4 of 6 benchmarks, including GSM8K (0.96) and IFEval (0.95), while Ornith led in MMLU 0-shot (0.91).
- For Grounding and Recall, Ornith scored highest on DROP (0.952) compared to Qwen3.6 (0.947) and Gemma4 (0.932), with all models scoring 10.0 on NIAH.
- In Code generation, Qwen3.6 outperformed Ornith in DS-1000 (0.66 vs 0.48) and SCICode (10.769 vs 1.538), though both matched Gemma4 on ClassEval (0.97).
- The author noted significant practical challenges, including infinite looping in Gemma4 and extreme processing times, such as IFEvalCode taking 18 hours for Qwen3.6.
The article highlights the difficulty of running comprehensive local benchmarks due to configuration issues and resource constraints, suggesting a need for more convenient testing methods.