A new study evaluates Large Language Models' ability to induce specific belief states in other agents through actions rather than conversation, a capability termed Non-Conversational Planning ToM (NCP-ToM). Using the NCP-ExploreToM framework, researchers tested six frontier models and human participants on 600 task instances where agents had to move objects or direct characters to achieve belief goals.
- GPT-5 succeeded on approximately 80% of tasks in the agentic setting.
- GPT-5 was the only model to outperform human participants, though it remained less robust across contexts.
- All models and humans performed better on inducing true belief states than false ones.
The findings highlight emerging social-reasoning capabilities in LLMs for non-conversational task completion and underscore the necessity of agentic evaluations for understanding the safety and alignment of autonomous social agents.