This study investigates whether verbose chain-of-thought prompting improves large language model reasoning through increased computation or by providing useful semantic content. The authors present evidence from in-distribution sampling and controlled interventions to determine the specific factors driving performance gains.

  • In-distribution analysis across 25 models showed that extra tokens left accuracy essentially unchanged when following the same reasoning plan.
  • Blind analysis of surplus tokens revealed that any gains tracked validation and checking content rather than verbosity itself.
  • Controlled interventions using dual-validator designs found verbose traces improved accuracy modestly (typically 1-4 points) depending on prose quality.
  • Under maximum numerical redaction, the effect was amplified with a median 3.24x increase across four arithmetic benchmarks.
  • Length-matched non-reasoning filler failed to recover any of the performance gains observed in verbose reasoning traces.

The findings converge on the conclusion that what matters is the reasoning and validation content carried by extra tokens, not merely their quantity.