Topic · Multimodal
arxiv arXiv cs.LG · 10d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.CL · 10d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 11d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

arxiv arXiv cs.AI · 11d ago

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

BinTrack is a fully open-source spatial question answering agent that uses binary search over a robot's trajectory to locate answers. It achieves up to 22.8% higher accuracy than other open-source methods and matches closed-source model performance on the most challenging global category of the SpaceLocQA benchmark. The system also offers over 1.5x faster inference and introduces GangnamLoop, a real-world outdoor benchmark collected with a quadruped robot.

arxiv arXiv cs.CL · 10d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 10d ago

LLMs Outperform Humans in Next Speaker Prediction

Large language models outperformed humans and supervised models in next speaker prediction using the AMI corpus, despite lacking audio-visual data and domain training. Multimodal LLMs surpassed text-based LLMs in addressee and turn-change detection but still fell short of human performance, highlighting challenges in utilizing raw audio-visual signals. Ablation studies show conversational context is crucial, especially for next speaker prediction, with both humans and LLMs struggling during frequent turn changes.

arxiv arXiv cs.CL · 10d ago

The Slop Paradox: AI Rewriting Degrades Clinical Uncertainty and Cross-Modal Alignment

AI-rewritten radiology reports show significant information loss, with EHR summarization eroding 51.4% of clinical entities and 43.7% of hedging language. Despite preserving image-text alignment, standardized and teaching case tasks reduce cross-modal alignment by 14.9-16.5%, six to seven times more than EHR summarization. The study finds no preferential degradation of rare pathologies and identifies rewriting task type as the key driver of degradation, not clinical content.

arxiv arXiv cs.CL · 10d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.AI · 11d ago

CrossMaps: Confidence-Aware Semantic Mapping for Rover Navigation

CrossMaps is a real-time, confidence-aware semantic mapping pipeline that uses RGB-D data to create language-queryable maps. It integrates multi-scale CLIP embeddings with a dual-memory architecture—Short-Term and Long-Term Memory—to aggregate visual observations and promote coherent, confident cells as persistent semantic landmarks. The system enables natural language queries to guide rover navigation via semantic heatmaps.