This study evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The research demonstrates that structured prompts significantly improve agreement with human graders, establishing a framework for AI-assisted assessment in computing education.
- The study used a four-level cognitive taxonomy ranging from information retrieval (L1) to advanced system management (L4).
- Models were tested on 1200 real responses from second-year Computer Engineering students graded by three expert instructors.
- Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10).
- Agreement declined consistently as taxonomy level increased, with the largest discrepancies occurring at higher levels.
- Across all models, rubric quality had a larger effect on performance than provider choice.
These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately and provide a transferable evaluation protocol for determining which questions require human review.