Multimodal
arxiv arXiv cs.LG · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.AI · 8d ago

Semantics-First Latent Modeling for 3D MRI Reconstruction

A new framework prioritizes anatomical semantics during 3D MRI latent compression, addressing long-range coherence and clinical detail loss. It introduces a Latent Harmonization Encoder and Semantic Recovery Block to preserve meaningful structures, and an Anatomy-aware Frequency Loss to maintain high-frequency diagnostic features. Experiments on public MRI datasets show improved reconstruction and cross-contrast synthesis quality.

arxiv arXiv cs.CL · 8d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

LLMs Outperform Humans in Next Speaker Prediction

Large language models outperformed humans and supervised models in next speaker prediction using the AMI corpus, despite lacking audio-visual data and domain training. Multimodal LLMs surpassed text-based LLMs in addressee and turn-change detection but still fell short of human performance, highlighting challenges in utilizing raw audio-visual signals. Ablation studies show conversational context is crucial, especially for next speaker prediction, with both humans and LLMs struggling during frequent turn changes.