BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

The authors identify a fundamental state-action credit mismatch in stepwise group-based RL for long-horizon LLM agents. Current estimators suffer from overly fine state partitioning and coarse action averaging, which violates equivalence assumptions for credit assignment. BiPACE is introduced as a drop-in advantage estimator that fixes these issues without adding critics or extra rollouts. It clusters steps by cosine distance in the actor's hidden-state geometry to reduce singleton groups and recenters returns using action-conditioned peer baselines. On ALFWorld with Qwen2.5-7B, BiPACE_Q raises validation success from 90.8 to 97.1±0.9, crossing the 95% threshold on every seed. It also improves performance on Qwen2.5-1.5B and achieves gains on WebShop and TextCraft over GRPO and GiGPO. The method incurs only 11.3% overhead of a single training-step wall time while changing the comparison unit to approximate behavioral equivalence.