Testing Qwen3.6-27B with a three-critic harness—comprising code review, test review, and Playwright e2e checks—makes the model usable for coding work by catching errors that smaller models naturally make.
- The harness includes distinct critics for code review, test review, and Playwright end-to-end testing, each provided with specific context.
- Fresh context per critic is critical, as reviewers who have not seen the code catch issues that self-review misses.
- A good critic pipeline reduces the reliability gap between a 27B model and frontier models by catching extra mistakes.
- The author argues that reliability comes from the process and scaffolding rather than model size or prompt-tuning alone.
The article concludes that teams running models in production should focus on verifying results through robust harnesses rather than blaming the model for flakiness.