This work addresses the tendency of multimodal large language models to produce overconfident outputs in Medical Visual Question Answering by proposing a training-based framework that finetunes these models for better calibration. The method employs a composite loss function combining Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL divergence terms to align model confidence with actual correctness.

  • The framework uses a $2 \times 2$ factorial perturbation design crossing image presence with text integrity to probe reliance on visual versus language inputs.
  • A top K KL divergence regularizer is applied to protect the model's answering ability during finetuning.
  • Experiments across three benchmarks and two architectures (MedGemma 4B IT and Qwen2 VL 7B Instruct) show a reduction in calibration error by over 60% and an improvement in discrimination by over 26%.
  • The approach outperforms prompting, sampling, and other training-based methods while preserving predictive accuracy, with all code publicly available.

This technique helps ensure that the confidence expressed by medical AI models accurately reflects their actual performance, which is critical for reliable clinical decision support.