This paper introduces RE4, a framework for imitation learning that combines principled manipulation theories with modern benchmarks to preserve both performance and interpretability in object interaction tasks. The approach utilizes lightweight, self-supervised pose estimation and mode-aware transformations to retrieve and replan demonstrations effectively.

  • Proposes lightweight training for model-free pose estimation of target objects using self-supervision over demonstration data.
  • Implements a manipulation mode-aware retrieval of demonstrations to inform the learning process.
  • Applies mode-aware transformation and a replan step that connects to the retrieval point while preserving mode constraints.
  • Evaluates the framework on state-based and image-based benchmarks in Push-T and Robomimic, including an adversarial benchmark for sparse data regions.

The work demonstrates the promise of using simple interpretable building blocks to learn manipulation skills, showing robustness in low-data regimes.