Researchers introduce MECoBench, a multimodal embodied cooperation benchmark designed to evaluate the collaborative capabilities of multimodal large language models (MLLMs) in visually grounded environments. The platform spans diverse real-world tasks and includes two cooperation structures alongside three distinct collaboration modes.
- Extensive experiments reveal that while collaboration generally improves task completion, benefits depend on balancing gains against coordination complexity.
- Communication is identified as essential for collaboration success, with optimal modes varying based on team size and model capability.
- The benchmark demonstrates that collaboration enhances robustness under noisy priors and exploration conditions.
MECoBench provides a systematic testbed for understanding the mechanisms and limits of multimodal embodied collaboration.