The authors introduce reference-free measures for evaluating the physical consistency of generated videos by combining relative and absolute fidelity assessments. This approach addresses the gap in physical fidelity that often prevents video generation tools like WorldGym or WorldEval from accurately reproducing real-world task success rates for VLA models. Unlike existing methods requiring costly human voting or unavailable ground-truth references, the new framework utilizes DROID-SLAM and SEA-RAFT to quantify inconsistencies. Motivated by WorldScore, the relative consistency assessment filters videos to improve task success rates by over 8%. Additionally, the absolute assessment enables spatio-temporal localization to visualize when and where physical artifacts occur in the generated content.
Reference-Free Assessment of Physical Consistency in World Model-based Video Generation
from English