Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Researchers extended the Werewolf game with a Jester role to create a triadic social-deduction environment that requires reasoning across three opposing utility functions, challenging large language models' theory-of-mind capabilities. Evaluations on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B revealed that while the Jester won 60-70% of games, GPT-4.1 wolves frequently voted the Jester out on day 1 in 60-70% of cases, a self-defeating action driven by language priors.

The Jester faction wins 60-70% of games while Werewolves never exceed a 20% win rate.
GPT-4.1 wolves voted the Jester out on day 1 in 60-70% of games, demonstrating strictly self-defeating behavior.
Self-learning improved performance for DeepSeek and Llama but harmed GPT-4.1, with the cost falling on Villagers rather than Werewolves.
Only DeepSeek learned the subtle strategy of appearing suspicious without looking intentionally suspicious.

This triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible, highlighting limitations in how current models simulate opponent incentives.