A user reports that Qwen3-VL-2B is the only viable vision-language model for reliably extracting data from images to JSON on low-spec devices like Intel i3 laptops with 8GB RAM. The author notes that despite its performance, the model is absent from major benchmarks such as Artificial Analysis and the Open LLM Leaderboard.

  • Testing was conducted on three "potato" laptops running Windows 11 with integrated GPUs.
  • Qwen3-VL-2B in Q4_K_M GGUF format outperformed both Qwen3-VL-4B and Qwen3.5 2B for this specific task.
  • Other tested models failed to produce acceptable results for JSON extraction on such hardware.

The article questions why the model is ignored by benchmarks and asks if other models can handle JSON extraction on potatoes, phones, or Raspberry Pis.