The paper introduces EduArt, an educational-level benchmark for evaluating art-historical knowledge and visual reasoning in multimodal large language models. It comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams.
- Twelve models from six provider families were evaluated under answer-only and motivation conditions.
- Multiple-choice accuracy saturated near ceiling for six models, failing to distinguish frontier capabilities.
- Format was a strong predictor of accuracy; Claude Opus 4.6 dropped from over 94% on multiple choice to 23.9% on open completion.
- The benchmark showed strong psychometric properties with a mean discrimination of 0.514.
The authors argue that single-format benchmarks overestimate model reliability and that mapping capability profiles is essential for responsible use in art-historical scholarship.