Researchers propose DASH (Drift Aware advantage SHaping), a method that assigns segment-level credit to reasoning traces based on whether each step moves toward or away from the correct answer. By using intermediate answer commitments as a proxy for productivity, the approach identifies where self-reflection helps versus hurts without requiring costly step-level annotations.

  • DASH compares final answer candidates in a trace to ground truth to determine if subsequent reflection is productive.
  • On competition-level math benchmarks, DASH achieves 50.8% accuracy on AIME25, outperforming the GRPO baseline of 45.4%.
  • The method reduces overthinking behaviors such as hedging and self-contradiction while enabling more productive self-correction.

This approach addresses the issue of extended chains of unproductive behavior that consume tokens without improving answers, even when controlling for response length.