This study addresses the neglect of factual error detection in human-written text by distilling a taxonomy of errors from newspaper article corrections, revealing categories like kanji misconversions that are absent in current hallucination benchmarks. The authors evaluate vanilla large language models on synthesized test cases and real corrections to assess their performance on this specific task.
- A taxonomy of human-induced factual errors was derived from analyzing corrections of newspaper articles.
- Characteristic error categories such as kanji misconversions and numeral classifier errors were identified as distinct from LLM hallucinations.
- High-performance models like GPT-5.4 achieved only a word-level F1 score of 52% on synthetic evaluation data.
- The experimental results highlight the significant difficulty of detecting factual errors in human-written text compared to existing benchmarks.