Qwen3.6-27B becomes usable for coding with a 3-critic harness

Testing Qwen3.6-27B with a three-critic harness—comprising code review, test review, and Playwright e2e checks—makes the model usable for coding work by catching errors that smaller models naturally make.

The harness includes distinct critics for code review, test review, and Playwright end-to-end testing, each provided with specific context.
Fresh context per critic is critical, as reviewers who have not seen the code catch issues that self-review misses.
A good critic pipeline reduces the reliability gap between a 27B model and frontier models by catching extra mistakes.
The author argues that reliability comes from the process and scaffolding rather than model size or prompt-tuning alone.

The article concludes that teams running models in production should focus on verifying results through robust harnesses rather than blaming the model for flakiness.