Autonomous Video Generation with Counterfactual Controllability for Self-Evolving World Models
The article argues that current video generation models learn only partial, implicit spatiotemporal world models rather than fully grounded or controllable ones. It asserts that predictive realism alone is insufficient for creating physical agents because these models often fail to identify controllable variables and embodiment constraints.