Researchers introduce DigitalCoach, a multimodal dataset comprising 72 human expert-novice computer use coaching sessions with 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications.

  • Automated evaluation shows models provide more direct instructions but fewer explanations, error diagnoses, and knowledge-check questions compared to humans.
  • When coaching methods are fixed, model utterances resemble human references but remain poorly grounded in visual context.
  • Interactive evaluations confirm that model coaches cause learners to passively follow instructions without deeper engagement.

The dataset lays a foundation for developing collaborative and proactive computer use coaching agents.