A study finds that large language models fail to reliably measure item discrimination in reading comprehension assessments. While some models show weak alignment with human-calibrated scores—ranging from 0.152 to 0.241—current LLMs do not adequately capture how assessment items distinguish students of different proficiency levels.
LLMs Struggle to Capture Item Discrimination in Reading Assessments
from English