The article introduces MemDelta, a controlled evaluation protocol for agent memory systems that isolates individual components to prevent confounding variables from skewing results. Using the LongMemEval-S dataset with 500 questions across three model families, the study reveals that reported gains often mix changes in memory methods with variations in language models or retrieval pipelines.
- Verbatim RAG performance (47.2%) is statistically similar to full-context GPT-4o-mini (49.8%), but ranking reversals occur across models, such as Gemini gaining +14pp from full context while Sonnet gains +31pp from RAG due to refusal rates.
- Swapping only the embedding model in an identical pipeline shifts accuracy by +6.2pp at n = 500 (p = 0.004), demonstrating that a single variable can flip conclusions, such as Mem0 beating MiniLM-RAG by +11pp but losing to cloud-RAG by 1.2pp.
- Agent self-memory achieves 42% accuracy, underperforming basic retrieval which reaches 47%.
- On two specific question types (n = 88), Mem0 matches cloud RAG performance (72.7% vs. 73.9%) at 50 times the cost, indicating narrow rather than general gains.
The authors recommend that memory evaluations fix embedding models across comparisons, stratify results by model family, and report write-path costs to accurately attribute performance gains to specific architectural changes.