A study of 22 open-weight large language models reveals that while the strength of clinical evidence can be recovered from model activations and text, the grades explicitly stated by the models are no better than chance. Researchers analyzed 45,134 clinical claims harmonized into four-level evidence grades to test whether models register and express evidence strength distinct from factual truth.

  • A linear estimator successfully recovered the evidence grade in every tested model with a median AUROC of 71.8.
  • The recoverable signal was largely lexical and did not transfer across topics or frameworks, yet remained distinct from factual truth.
  • Stated grades fell to chance levels, performing 25-27 percentage points below the estimator's performance.
  • Decodability of evidence strength did not increase with model scale and was weakest in reasoning models.

The findings indicate that clinical LLMs carry an ordered signal of evidence support within their representations but fail to express it, meaning their stated grades do not accurately convey a claim's support even when the information is present.