Researchers introduce DigitalCoach, a multimodal dataset comprising 72 human expert-novice computer use coaching sessions with 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications.
- Automated evaluation shows models provide more direct instructions but fewer explanations, error diagnoses, and knowledge-check questions compared to humans.
- When coaching methods are fixed, model utterances resemble human references but remain poorly grounded in visual context.
- Interactive evaluations confirm that model coaches cause learners to passively follow instructions without deeper engagement.
The dataset lays a foundation for developing collaborative and proactive computer use coaching agents.