Testing Qwen3.6-27B with a three-critic harness—comprising code review, test review, and Playwright e2e checks—makes the model usable for coding work by catching errors that smaller models naturally make.

  • The harness includes distinct critics for code review, test review, and Playwright end-to-end testing, each provided with specific context.
  • Fresh context per critic is critical, as reviewers who have not seen the code catch issues that self-review misses.
  • A good critic pipeline reduces the reliability gap between a 27B model and frontier models by catching extra mistakes.
  • The author argues that reliability comes from the process and scaffolding rather than model size or prompt-tuning alone.

The article concludes that teams running models in production should focus on verifying results through robust harnesses rather than blaming the model for flakiness.