Multimodal
arxiv arXiv cs.LG · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.AI · 8d ago

Semantics-First Latent Modeling for 3D MRI Reconstruction

A new framework prioritizes anatomical semantics during 3D MRI latent compression, addressing long-range coherence and clinical detail loss. It introduces a Latent Harmonization Encoder and Semantic Recovery Block to preserve meaningful structures, and an Anatomy-aware Frequency Loss to maintain high-frequency diagnostic features. Experiments on public MRI datasets show improved reconstruction and cross-contrast synthesis quality.

arxiv arXiv cs.CL · 8d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

LLMs Outperform Humans in Next Speaker Prediction

Large language models outperformed humans and supervised models in next speaker prediction using the AMI corpus, despite lacking audio-visual data and domain training. Multimodal LLMs surpassed text-based LLMs in addressee and turn-change detection but still fell short of human performance, highlighting challenges in utilizing raw audio-visual signals. Ablation studies show conversational context is crucial, especially for next speaker prediction, with both humans and LLMs struggling during frequent turn changes.

arxiv arXiv cs.CL · 8d ago

The Slop Paradox: AI Rewriting Degrades Clinical Uncertainty and Cross-Modal Alignment

AI-rewritten radiology reports show significant information loss, with EHR summarization eroding 51.4% of clinical entities and 43.7% of hedging language. Despite preserving image-text alignment, standardized and teaching case tasks reduce cross-modal alignment by 14.9-16.5%, six to seven times more than EHR summarization. The study finds no preferential degradation of rare pathologies and identifies rewriting task type as the key driver of degradation, not clinical content.