Studies show that misleading feedback can cause LLM agents to perform worse than with no feedback at all. On HotpotQA, Qwen2.5-7B drops from 44.8 to 4.7 F1 under shuffled retrieval, despite clean tools. These results indicate that tool gains may be overstated and no-feedback controls are essential for valid evaluation.
Unreliable Feedback Can Harm Tool-Using LLM Agents
from English