The authors introduce AgenticSTS, a testbed designed to study how explicit memory layers shape long-horizon LLM-agent decisions. It utilizes a bounded-memory contract in the game Slay the Spire 2 where prompts are assembled by typed retrieval rather than appending raw transcripts.

  • The system uses a fixed-A0 ablation showing that enabling strategic skills increased wins from 3/10 to 6/10 games.
  • Public benchmarks report zero wins for frontier LLMs at the lowest difficulty, while human win rates are 16%.
  • A public online benchmark of frontier LLMs on Slay the Spire 2 reports zero wins across five configurations.
  • The release includes 298 completed trajectories with condition tags, frozen memory/skill snapshots, and analysis scripts.

This work provides a validated, reusable methodology for isolating the effects of specific memory components in agent design.