A study finds that large language models fail to reliably measure item discrimination in reading comprehension assessments. While some models show weak alignment with human-calibrated scores—ranging from 0.152 to 0.241—current LLMs do not adequately capture how assessment items distinguish students of different proficiency levels.