This study evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The research demonstrates that structured prompts significantly improve agreement with human graders, establishing a framework for AI-assisted assessment in computing education.

  • The study used a four-level cognitive taxonomy ranging from information retrieval (L1) to advanced system management (L4).
  • Models were tested on 1200 real responses from second-year Computer Engineering students graded by three expert instructors.
  • Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10).
  • Agreement declined consistently as taxonomy level increased, with the largest discrepancies occurring at higher levels.
  • Across all models, rubric quality had a larger effect on performance than provider choice.

These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately and provide a transferable evaluation protocol for determining which questions require human review.