The article argues that current video generation models learn only partial, implicit spatiotemporal world models rather than fully grounded or controllable ones. It asserts that predictive realism alone is insufficient for creating physical agents because these models often fail to identify controllable variables and embodiment constraints.
- Existing literature claims video generation essentially constitutes world modelling, pushing AI toward temporally extended physical scenes.
- The authors contend that scaling visual prediction does not automatically yield physical agents capable of understanding controllability.
- The proposed solution emphasizes counterfactual controllability as the decisive criterion for self-evolving generative nature.
- This approach involves testing if generated futures survive embodiment constraints and feeding action knowledge back into future imagination.
The authors consider this perspective important because it provides a pathway to realize self-evolving world models through autonomous video generation with counterfactual controllability.