The study introduces MedQADE, a standardized open-response clinical benchmark for German comprising 3,800 items annotated by ten physicians and nine LLM evaluators. It investigates whether automated LLM-as-a-Judge approaches replicate the calibration and caution of human clinicians.

  • The top-performing model, Gemini 3 Flash, achieved alignment with physician ratings (Cohen's kappa = 0.694 vs. 0.709), though wide confidence intervals limit interpretation.
  • Automated evaluators exhibited near-absent clinical metacognition by assigning definitive scores to every case, whereas physicians scaled abstention based on item difficulty.
  • The study quantified systematic lineage-dependent biases where models preferentially scored architectural siblings, an effect independent of language.

The results demonstrate that statistical alignment does not ensure clinical caution and that evaluator independence requires explicit verification.