MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

The article introduces MemDelta, a controlled evaluation protocol for agent memory systems that isolates individual components to prevent confounding variables from skewing results. Using the LongMemEval-S dataset with 500 questions across three model families, the study reveals that reported gains often mix changes in memory methods with variations in language models or retrieval pipelines.

Verbatim RAG performance (47.2%) is statistically similar to full-context GPT-4o-mini (49.8%), but ranking reversals occur across models, such as Gemini gaining +14pp from full context while Sonnet gains +31pp from RAG due to refusal rates.
Swapping only the embedding model in an identical pipeline shifts accuracy by +6.2pp at n = 500 (p = 0.004), demonstrating that a single variable can flip conclusions, such as Mem0 beating MiniLM-RAG by +11pp but losing to cloud-RAG by 1.2pp.
Agent self-memory achieves 42% accuracy, underperforming basic retrieval which reaches 47%.
On two specific question types (n = 88), Mem0 matches cloud RAG performance (72.7% vs. 73.9%) at 50 times the cost, indicating narrow rather than general gains.

The authors recommend that memory evaluations fix embedding models across comparisons, stratify results by model family, and report write-path costs to accurately attribute performance gains to specific architectural changes.