AIPatient Arena: EHR-grounded evaluation of LLMs in clinical workflows

AIPatient Arena evaluates large language models in end-to-end clinical consultations using EHR-grounded patient-specific knowledge graphs. It assesses LLMs across eight clinical competence dimensions, revealing strong performance in interview skills, ethics, and explanation clarity, but persistent weaknesses in handling ambiguity, information coverage, and diagnostic reasoning, with process failures like repetitive questioning and omitted history.