Open-sourcing a harness for evaluating VLMs on your own video with traced runs
The authors have open-sourced a harness for evaluating Vision-Language Models (VLMs) that allows users to test models on their own video data with full reproducibility through traced runs. This tool ties every result to its specific input and configuration, enabling accurate evaluation of accuracy, latency, and cost.