STAITUS: Disentangling Appearance and Pose for Video Object Tracking

The article introduces STAITUS, a unified framework for unsupervised video object tracking that addresses the limitations of existing slot-based representations by explicitly disentangling appearance from geometric pose. By applying temporal alignment only in appearance space and enforcing spatial separation within frames, the method prevents slots from locking onto static backgrounds during motion.

STAITUS disentangles each slot into appearance and geometric pose (position/scale) to resolve conflicts between consistency objectives and object motion.
The framework enforces within-frame spatial separation and applies temporal alignment exclusively in appearance space to improve mask sharpness and identity persistence.
An adaptive gating mechanism is introduced to dynamically adjust the number of active slots, mitigating over-segmentation based on scene complexity.
Extensive experiments on synthetic and real-world benchmarks show that STAITUS substantially outperforms state-of-the-art baselines in segmentation quality and tracking stability.

This approach helps users achieve more accurate object-centric decomposition in dynamic scenes by maintaining persistent identities under conditions such as motion, occlusion, and object entry or exit.