Researchers introduce the Planning Experience Exploration and Utilization (PEEU) method to enhance task planning in multimodal web agents using small open-source Multimodal Large Language Models (MLLMs). This approach autonomously explores environments to discover experiences and synthesizes high-level training data through hindsight experience utilization.
- PEEU enables small MLLMs to overcome weak planning and limited cross-website generalization by leveraging autonomous exploration and hindsight synthesis.
- The Task Decomposition Hierarchical Analysis Framework (TDHAF) is proposed to study compositional generalization across low, middle, and high task granularities.
- Analysis reveals that mastering low-level atomic skills does not guarantee high-level planning competence, whereas high-level task training yields stronger out-of-distribution (OOD) generalization.
- A 7B model using PEEU achieves 30.6% accuracy on real-world benchmarks, outperforming the larger Qwen2.5-VL-32B model.
These findings demonstrate that constructing high-level tasks and leveraging experiences is crucial for improving the OOD planning abilities of small MLLMs in GUI agent applications.