A new open-source tool called the Visual Calendar Comprehension Benchmark (VCCB) measures how well multimodal models can extract structured data from week-view calendar screenshots. The benchmark tests extraction accuracy across nine images rendered in Outlook, HCL Notes, and Thunderbird, including clean screenshots and perspective-distorted photos.

  • Humans achieve ~99% accuracy, while frontier hosted models like Opus score 80-85%.
  • Mid-tier models such as ChatGPT free reach approximately 75%, whereas local models and Claude Haiku perform between 38-58%.
  • The benchmark is designed to help users quantify the accuracy loss caused by model quantization and prompt variations.

The author invites the community to run the benchmark locally and submit results to a public leaderboard to better understand performance gaps in local AI capabilities.