The article challenges the dichotomy between large language models and world models by arguing that LLMs are actually a degenerate special case of world models rather than a replacement. It posits that there is a continuous spectrum from next-token prediction to latent-space architectures, with current research already occupying intermediate positions.
- The state space for LLMs is defined as the set of all token sequences with only one action: appending a single token.
- World models are presented as a strict generalization of this framework rather than an alternative paradigm.
- A continuous spectrum exists between next-token prediction and JEPA, populated by multi-token and future-summary prediction methods.
- Moving along this spectrum progressively relaxes LLM constraints while surrendering internet-scale self-supervised data and transformer architecture advantages.
The authors identify two open research questions regarding the transition: whether self-supervised text data can scale to instrumented action-labelled environments and if transformers generalize to continuous-state prediction.