A user evaluated eight local models on a custom medieval fantasy role-playing benchmark covering quest completion, scene endings, and character detection. The test was judged by an external LLM grader across varying sample sizes per category.

  • Gemma-4-31B achieved the highest overall pass rate at 87%.
  • Qwen3.6-27B followed closely with an 82% pass rate.
  • Gemma-4-12B scored 80%, while smaller models ranged between 55% and 70%.
  • The evaluation revealed significant performance cliffs in specific sub-categories like NPC thoughts, which were masked by overall scores.

The author highlights that looking only at overall percentages hides uneven model capabilities across different role-playing tasks.