Qwen3.6-27B scores 82% in fantasy RP benchmark, trailing Gemma-4-31B

A user evaluated eight local models on a custom medieval fantasy role-playing benchmark covering quest completion, scene endings, and character detection. The test was judged by an external LLM grader across varying sample sizes per category.

Gemma-4-31B achieved the highest overall pass rate at 87%.
Qwen3.6-27B followed closely with an 82% pass rate.
Gemma-4-12B scored 80%, while smaller models ranged between 55% and 70%.
The evaluation revealed significant performance cliffs in specific sub-categories like NPC thoughts, which were masked by overall scores.

The author highlights that looking only at overall percentages hides uneven model capabilities across different role-playing tasks.