This article identifies a distinct failure mode in large language model agents where they struggle to discard outdated facts in favor of current ones, even when comprehension is intact. The authors demonstrate that this "supersession gap" persists across model scales and memory sizes, indicating it is a trainable bottleneck rather than a limitation of context window or model strength.

  • Replacing full context with bounded memory on LongMemEval drops accuracy from 92% to 77% for GPT-5.4, a statistically significant gap (p<0.005).
  • As conversation length increases 24x, accuracy falls further from 68% to 28%, and granting proportionally more memory yields no recovery.
  • The authors release Supersede, an open reinforcement-learning environment that rewards agents for using current values and penalizes stale ones.
  • GRPO fine-tuning Qwen2.5-3B on this environment nearly doubles held-out supersession accuracy from 9.0% to 16.7%.

This work provides the first evidence that the memory-update gap can be trained down using a reward signal targeting temporal fact-currency, rather than merely measured.