The article introduces a behavioral evaluation framework to calibrate test-time training (TTT) memory claims against actual deployment capabilities like personalization and recall. It argues that standard proxy metrics such as perplexity are insufficient evidence for these advanced behaviors, which require direct behavioral validation.
- The framework includes a claim-calibrated evidence ladder distinguishing stream adaptation from deployment-time behavioral learning.
- It utilizes an evaluation protocol with explicit-memory baselines and mutually exclusive failure categories.
- Validation via a controlled diagnostic on Qwen3 models shows that while one-step LoRA updates lower support and answer loss, free-form recall remains at zero.
This approach provides authors and evaluators with a concrete standard for aligning TTT memory claims with the evidence actually reported.