Aloe-Vision: Robust Vision-Language Models for Healthcare

This work introduces Aloe-Vision, a family of open-source large vision-language models (7B and 72B) trained on the newly released Aloe-Vision-Data dataset to address data scarcity and robustness issues in healthcare AI. The authors demonstrate that their high-quality training mixture yields significant performance gains over baselines while maintaining general capabilities.

Aloe-Vision-Data: A large-scale, quality-filtered mixture of medical and general domain multimodal and text-only sources for model fine-tuning.
Open Release: Full weights, training recipes, and data are openly released for the 7B and 72B model scales.
CareQA-Vision: A new vision benchmark derived from Spanish medical and nursing residency exams (MIR and EIR) with low contamination risk.
Performance: The models achieve competitive performance against state-of-the-art alternatives without compromising general capabilities.
Vulnerability Analysis: Current LVLMs remain vulnerable to adversarial and misleading inputs, highlighting reliability challenges in clinical contexts.