The authors introduce MedEvoEval, an executable longitudinal evaluation framework designed to assess the continual evolution of doctor agents through simulated outpatient clinical episodes. This system moves beyond static benchmarks by tracking how agents acquire evidence, utilize resources, and refine their decision-making across multiple interactions.
- The framework converts source cases into role-specific patient, examination, and manager views, revealing evidence only through valid actions.
- Each episode generates a structured trace linking observations, actions, final outputs, manager scores, and optional experience write-back.
- A runnable artifact is released containing 700 processed episodes, provenance notes, schemas, an episode runner, scoring scripts, and analysis code.
- Experiments demonstrate that episode traces reveal process costs hidden by final-answer scoring and show how MDT-style consultation reallocates resources.
- The framework supports longitudinal analyses of memory maturation, held-out transfer, update-stage response, and backward retention.
MedEvoEval provides a concrete basis for evaluating whether doctor agents improve through experience, transfer useful behavior, and retain earlier capabilities over time.