EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. When trained with GRPO, EnvRL improves success rates of Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop.
arxiv
arXiv cs.LG
·
8d ago
·
src: 9d ago
·
research
EnvRL: Leveraging Environment Dynamics in Agentic RL
from English
Importance 3/3
New feature vs. leaders
New harness with differentiators
arXiv cs.LG
Alibaba (Qwen)
AI agents
Reasoning models
Training methods
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| WebArena | Qwen-2.5-1.5B-Instruct | 77.4% |