Solo and MoA Benchmarking on multiple tasks

The article presents benchmark results comparing individual models against Mixture-of-Agents (MoA) configurations across six tasks: Bug, Tool, Arch, Clinical, DLQ, and an overall average. The evaluation harness used Hermes Agent v0.18, with scores generated by ChatGPT 5.5 and Claude opus 4.8 based on a rubric weighting Correctness, Completeness, Depth, Actionability, Clarity, and Trust.

The top-ranked configuration was an MoA using Gemma-4-12B-4bit(vLLM), Ornith1.0-35B-Q4_K_M(llama.cpp), and Qwen-3.6-27B-4bit(vLLM) as drafters with Qwen-3.6-27B-4bit(vLLM) as the aggregator, achieving an average score of 86.7.
The second-ranked MoA configuration used DeepSeek-v4-Pro (cloud) as the aggregator and scored 85.9 overall.
The highest-performing solo model was Qwen3.6-35B-A3B-Q4_K_M(llama.cpp) at rank 3 with an average of 85.2, followed by Qwen-3.6-27B-4bit(vLLM) at rank 6 with 84.6.
Nemotron 2 Cascade Q4_K_M(llama.cpp) performed poorly as a solo model (rank 14, score 5.8) and also yielded low results when used as an aggregator in MoA setups.

The results indicate that specific MoA configurations can outperform individual large models, particularly in tasks requiring high correctness and completeness.