This study introduces clinical reasoning graphs to evaluate the diagnostic reasoning patterns of large language models, revealing that while they achieve competence, they lack consistent reasoning schemas. The authors extracted structured graph representations from 750 traces across five LLMs and tested for stable reasoning patterns in clinically similar cases.
- Clinical reasoning graphs use a domain-grounded ontology with 5 node types and 7 edge types to represent LLM diagnostic traces.
- Analysis of 750 traces from five LLMs on NEJM Clinicopathological Conference cases found no significant difference in graph similarity between clinically similar and dissimilar cases.
- Graph similarity was nearly identical for pairs of models that were both correct (0.488) and both incorrect (0.484).
- Structured reflection prompting increased explicit discriminating-feature analysis by 33% but did not improve cross-case consistency.
The findings indicate that final-answer accuracy should be complemented by process-level evaluation to distinguish stable reasoning from pattern matching, and the authors release their ontology, pipeline, and artifacts as resources for structured evaluation.