The article identifies "intervention bias" as a critical failure mode in zero-shot large-language-model educational advisory agents, where they incorrectly recommend action despite oracle policies mandating inaction. Using the Open University Learning Analytics Dataset, the study demonstrates that zero-shot GPT-4o exhibits a 43 percentage-point false-positive rate at day 56, leading to approximately 4,300 unnecessary advisor contacts per cycle for 10,000 students.

  • Supervised policy learning using a trajectory-conditioned ONNX Decision Transformer (DT) and an XGBoost classifier eliminates this bias, achieving near-zero calibration error.
  • The DT model reaches a macro-F1 of 0.79 and macro-recall of 0.85 across five action classes, including rare load-reduction actions, with a 0% action flip rate.
  • Both supervised models achieve sub-5 ms CPU decision latency, with the DT showing an indicative edge over XGBoost at the final cutoff.
  • The study reveals an evaluation gap where LLM-as-judge scoring (DeepEval G-Eval) rewards fluent over-prescription rather than actual decision quality.

The authors argue that supervised learning is essential for high-stakes applications to ensure deterministic decisions and avoid the miscalibration inherent in zero-shot LLM approaches.