The article introduces STAITUS, a unified framework for unsupervised video object tracking that addresses the limitations of existing slot-based methods by explicitly disentangling appearance from geometric pose. This approach resolves conflicts between temporal consistency and object motion, preventing slots from locking onto static backgrounds.

  • STAITUS enforces within-frame spatial separation and applies temporal alignment only in appearance space to maintain persistent identities under motion and occlusion.
  • An adaptive gating mechanism is introduced to dynamically adjust the number of active slots based on scene complexity, mitigating over-segmentation.
  • Extensive experiments on synthetic and real-world benchmarks show that STAITUS substantially outperforms state-of-the-art baselines in segmentation quality and tracking stability.

By decoupling appearance and pose, the framework yields sharper masks and more stable tracking for foreground objects, even during entry, exit, or occlusion events.