Distilling Transformers into Recurrent Transformers for Efficient Memory

A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.