Researchers introduce Task-State Representation (TSR), a training-free framework designed to address the context burden faced by long-horizon mobile GUI agents. TSR explicitly decouples persistent task states from transient screen observations, preventing issues like forgetting initial requirements or hallucinating progress.
- The framework acts as a lightweight external wrapper maintaining three components: a global instruction summary, a dynamic progress tracker for subgoals, and a transition-aware action verifier.
- It continuously updates through pre- and post-action visual comparisons to guide agent reasoning without requiring architectural modifications.
- Experiments across four mobile GUI benchmarks show TSR yields up to a 12 absolute point increase in success rate on complex cross-application and memory-intensive tasks.
TSR effectively guides the agent's reasoning by managing state separately from sensory input, validating its effectiveness on challenging mobile GUI tasks.