GPT-5 outperforms humans in inducing belief states via planning

A new study evaluates Large Language Models' ability to induce specific belief states in other agents through actions rather than conversation, a capability termed Non-Conversational Planning ToM (NCP-ToM). Using the NCP-ExploreToM framework, researchers tested six frontier models and human participants on 600 task instances where agents had to move objects or direct characters to achieve belief goals.

GPT-5 succeeded on approximately 80% of tasks in the agentic setting.
GPT-5 was the only model to outperform human participants, though it remained less robust across contexts.
All models and humans performed better on inducing true belief states than false ones.

The findings highlight emerging social-reasoning capabilities in LLMs for non-conversational task completion and underscore the necessity of agentic evaluations for understanding the safety and alignment of autonomous social agents.