Zero-shot Procedural Mistake Detection with VLMs
A unified zero-shot framework, ZeProM, uses a pre-trained Video-Language Model to jointly perform procedural mistake detection and temporal action segmentation. It achieves up to 4.4 point improvement in EDA and 2.0 point in F1@.5 on EgoPER tasks, matching or exceeding supervised methods without task-specific training.