The paper introduces EduArt, an educational-level benchmark for evaluating art-historical knowledge and visual reasoning in multimodal large language models. It comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams.

  • Twelve models from six provider families were evaluated under answer-only and motivation conditions.
  • Multiple-choice accuracy saturated near ceiling for six models, failing to distinguish frontier capabilities.
  • Format was a strong predictor of accuracy; Claude Opus 4.6 dropped from over 94% on multiple choice to 23.9% on open completion.
  • The benchmark showed strong psychometric properties with a mean discrimination of 0.514.

The authors argue that single-format benchmarks overestimate model reliability and that mapping capability profiles is essential for responsible use in art-historical scholarship.