EduArt benchmark reveals multimodal LLMs overestimate art history knowledge

The paper introduces EduArt, an educational-level benchmark for evaluating art-historical knowledge and visual reasoning in multimodal large language models. It comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams.

Twelve models from six provider families were evaluated under answer-only and motivation conditions.
Multiple-choice accuracy saturated near ceiling for six models, failing to distinguish frontier capabilities.
Format was a strong predictor of accuracy; Claude Opus 4.6 dropped from over 94% on multiple choice to 23.9% on open completion.
The benchmark showed strong psychometric properties with a mean discrimination of 0.514.

The authors argue that single-format benchmarks overestimate model reliability and that mapping capability profiles is essential for responsible use in art-historical scholarship.