This study conducts a factorised analysis of probe-based uncertainty estimation to determine what drives performance in detecting hallucinations within Large Language Models. The research isolates variables across feature design, training data, and evaluation settings to provide clear insights into effective methodologies.

  • Raw hidden states and attention features outperform other options in-domain but struggle under distribution shift.
  • Structured and compressed features prove more robust when facing distribution shifts compared to raw signals.
  • Prompting strategies and label construction significantly influence probe behavior and performance outcomes.
  • Benchmark-based pretrained probes were developed that transfer reasonably well to open-ended factual generation tasks.

The authors provide a stable off-the-shelf baseline for uncertainty estimation and encourage the community to adopt more deployment-oriented evaluation methods for these estimators.