Clinical Evidence Strength Is Recoverable From LLM Representations, Not Stated Grades
A study of 22 open-weight large language models reveals that while the strength of clinical evidence can be recovered from model activations and text, the grades explicitly stated by the models are no better than chance. Researchers analyzed 45,134 clinical claims harmonized into four-level evidence grades to test whether models register and express evidence strength distinct from factual truth.