Systematic evaluation shows LLMs make data referencing errors across all sizes

A new study presents the first systematic evaluation of tabular data referencing errors (DREs), which involve incorrectly citing or omitting table values despite understanding structure. The research finds that these errors occur across all tested models ranging from 1.7B to 20B parameters.

DREs compromise the correctness and reliability of intermediate reasoning steps in large language models.
Incorporating data referencing as a critic improves answer accuracy by up to 12.0% via filtering and rejection sampling.
A lightweight 4B-parameter critic model was trained, achieving an average F1 score of 78.2% for detecting both in-distribution and out-of-distribution DREs.

The authors demonstrate that using a dedicated critic model effectively assists inference for larger models by identifying these specific referencing failures.