Systematic Benchmark of Lightweight Hallucination Detection Across QA, Dialogue, and Summarisation
This paper benchmarks five lightweight, CPU-feasible hallucination detection methods to provide practical alternatives for resource-constrained researchers who cannot use GPU-intensive or proprietary solutions. The study evaluates ROUGE-L, semantic similarity, BERTScore, a FEVER-trained DeBERTa NLI detector, and an ensemble of similarity and NLI across the HaluEval benchmark's question answering, dialogue, and summarisation tasks.