The authors have open-sourced a harness for evaluating Vision-Language Models (VLMs) that allows users to test models on their own video data with full reproducibility through traced runs. This tool ties every result to its specific input and configuration, enabling accurate evaluation of accuracy, latency, and cost.

  • The framework supports building small evaluation sets from production-like footage rather than relying solely on leaderboards.
  • Traced runs ensure that every result is linked to its corresponding input data and configuration parameters.
  • An open repository is provided to allow users to reproduce evaluations on their own datasets.
  • The approach emphasizes optimizing frame sampling and scene boundaries, which can impact accuracy more than model selection.

This tool helps users address the practical challenges of VLM evaluation by focusing on latency and cost constraints alongside accuracy metrics.