The authors propose Erase-then-Delta Attention (EDA), a memory update rule for recurrent models that decouples the address used to erase stale information from the address used to write new content. This approach addresses the limitation of delta-rule linear attention, which cannot actively remove outdated data stored at different locations before writing.
- EDA applies a targeted erase step along a learned direction followed by a standard delta-style corrective write.
- Pretraining experiments on dense 2.5B and MoE 25B-A2.8B models show EDA outperforms existing methods in both settings.
- Gains persist after 80B-token long-context midtraining, with superior performance in evaluations ranging from 4k to 128k contexts.
- Analysis indicates EDA allocates an additional cleanup path most strongly when passive decay is weak.
The results suggest that recurrent memory models should independently decide what stale information to erase and where, rather than relying solely on corrective writes at the current address.