Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation

This article addresses the issue of unidentifiable latent states in LLM-based world models caused by history bypass, proposing strict latent state mediation to resolve this. The authors introduce textual latent states and factorized GRPO (fGRPO), a tree-structured reinforcement learning method that enforces strict mediation during training.

Strict mediation requires predictions to depend only on the latent state and action, making representation quality empirically testable.
Textual latent states are discrete, interpretable, and variable-length, overcoming the non-differentiability of traditional text-based representations.
Factorized GRPO (fGRPO) is a tree-structured reinforcement learning method designed to enforce strict mediation during training.

Experiments on TextWorld and ScienceWorld demonstrate up to 57% gains in representation quality and 98% improvements in rollout performance, with benefits increasing alongside task complexity.