This article addresses the issue of unidentifiable latent states in LLM-based world models caused by history bypass, proposing strict latent state mediation to resolve this. The authors introduce textual latent states and factorized GRPO (fGRPO), a tree-structured reinforcement learning method that enforces strict mediation during training.
- Strict mediation requires predictions to depend only on the latent state and action, making representation quality empirically testable.
- Textual latent states are discrete, interpretable, and variable-length, overcoming the non-differentiability of traditional text-based representations.
- Factorized GRPO (fGRPO) is a tree-structured reinforcement learning method designed to enforce strict mediation during training.
Experiments on TextWorld and ScienceWorld demonstrate up to 57% gains in representation quality and 98% improvements in rollout performance, with benefits increasing alongside task complexity.