Researchers have introduced QVal, a training-free testbed designed to directly evaluate the quality of dense supervision signals used in long-horizon LLM agents. Unlike standard practices that conflate signal quality with training engineering by measuring downstream performance, QVal assesses how well a method's score aligns with the Q-values of a strong reference policy.
The authors instantiated QVal as QVal-v1.0 to benchmark 21 dense supervision methods across four diverse environments and seven methodological families. The evaluation involved over 1.2K experiments conducted across six open-weight model backbones.
The study found that simple prompting baselines consistently outperform recent dense supervision methods from the literature, with performance clustering strongly by family. These findings hold across various model sizes, environments, and observation modalities.