The article introduces Einstein World Models (EWMs), a framework designed to enhance large language model reasoning by integrating visual-temporal rollouts into the reasoning trace. This approach allows models to utilize visual thought experiments as inspectable hypotheses to complement text-based processing.
- EWMs enable LLMs to call a world-module to generate short scenes under consideration, treating these outputs as hypotheses rather than final answers.
- The framework extends current tool-calling capabilities, such as web search or code execution, into the domain of visual counterfactual reasoning.
- This mechanism supports complex thought processes that may be difficult to capture through language alone by visualizing events beyond direct experience.