Systematic Benchmark of Lightweight Hallucination Detection Across QA, Dialogue, and Summarisation

This paper benchmarks five lightweight, CPU-feasible hallucination detection methods to provide practical alternatives for resource-constrained researchers who cannot use GPU-intensive or proprietary solutions. The study evaluates ROUGE-L, semantic similarity, BERTScore, a FEVER-trained DeBERTa NLI detector, and an ensemble of similarity and NLI across the HaluEval benchmark's question answering, dialogue, and summarisation tasks.

No single method dominates; performance is highly task-dependent.
The ensemble achieves the best results on question answering with an F1 score of 0.792 and AUC-ROC of 0.873.
The NLI detector leads in dialogue detection with an AUC-ROC of 0.713.
All five methods degrade to near-random performance on summarisation, with AUC-ROC scores between 0.469 and 0.574.

This systematic failure on summarisation maps the practical frontier of GPU-free hallucination detection and offers guidance for method selection under computational constraints.