Unified benchmark for span-level hallucination detection over code, tool output, and documents

The authors introduce a unified benchmark for span-level hallucination detection that extends beyond natural-language document evidence to include structured inputs such as source code, developer-tool output, markdown documents, tables, and repository metadata. The benchmark is constructed by injecting localized hallucinations with exact character labels into grounded correct answers and validating the code test split through evidence-based review.

Fine-tuned Qwen3.5-2B detector achieves 0.689 span-F1 on the unified test set.
On the code-agent source, the model reaches 0.60 span-F1, substantially outperforming LettuceDetect-large (0.17) and zero-shot LLM judges (at most 0.22).
The same model remains competitive on established natural-language benchmarks, scoring 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.

This work addresses the growing need for hallucination detection in grounded generation systems that increasingly rely on structured inputs rather than just natural language.