LLM Medical Scribing Benchmark: Omissions Outnumber Hallucinations
A benchmark of 8 LLMs on 300 synthetic doctor-patient dialogues found 12 high-impact hallucinations and 520 clinically relevant omissions. Omissions were far more common than hallucinations, with DeepSeek excelling in prose and cost but missing many safety facts, while Claude Opus had fewest omissions but poorer prose quality.