Multimodal — korshunov.ai

Topic · Multimodal

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.AI · 10d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 10d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.CL · 10d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 10d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 10d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.

arxiv arXiv cs.CL · 11d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

arxiv arXiv cs.AI · 11d ago

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

BinTrack is a fully open-source spatial question answering agent that uses binary search over a robot's trajectory to locate answers. It achieves up to 22.8% higher accuracy than other open-source methods and matches closed-source model performance on the most challenging global category of the SpaceLocQA benchmark. The system also offers over 1.5x faster inference and introduces GangnamLoop, a real-world outdoor benchmark collected with a quadruped robot.

arxiv arXiv cs.LG · 10d ago

ASTEROID: Transformer for Multi-Step MD Forecasting

ASTEROID is a data-driven framework that predicts multi-step atomic coordinates in molecular dynamics simulations without iterative integration. It uses a spatiotemporal Transformer architecture to model multiscale dependencies, achieving higher accuracy and reduced computational cost compared to existing methods on quantum-mechanics derived datasets.

arxiv arXiv cs.LG · 10d ago

CERS: CoT-Enhanced Reasoning for Medical Image Segmentation

CERS introduces Chain-of-Thought reasoning to improve semi-supervised medical image segmentation by integrating linguistic descriptions from large language models. It uses a semantic-aware reference selection and multi-scale coordinate attention to resolve boundary ambiguities and semantic inconsistencies, outperforming state-of-the-art methods in clinical scenarios with visual-semantic mismatch.

arxiv arXiv cs.AI · 10d ago

Quality-Aware Self-Distillation for GUI Grounding

A new method improves GUI grounding by using soft correctness-aware gating and teacher-probability scaling to enhance coordinate-token teacher signals. These components work together to suppress unreliable supervision and calibrate remaining signals, with experiments showing consistent performance gains across six benchmarks.

arxiv arXiv cs.CL · 10d ago

MLLP-VRAIN's Simultaneous Speech Translation Submission for IWSLT 2026

The MLLP-VRAIN group submits a cascaded SimulST system using Parakeet and Qwen 3.5 models with adaptive black-box policies. For En→De, It, Zh, it employs ASR word-boosting and RAG with pre-translated exemplars in the new context track, achieving +5.82 XCOMET-XL improvement on MCIF En→De and an additional +1.03 gain via context integration.

arxiv arXiv cs.CL · 10d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 10d ago

SpeechDx: Multi-Task Benchmark for Clinical Speech AI

SpeechDx introduces a large-scale benchmark with 12 datasets and 27 tasks across diverse health conditions. It evaluates models by speech production stages and reveals that large-scale models perform best, while domain-specific models show limited generalization across clinical conditions.

arxiv arXiv cs.CL · 10d ago

Operationalizing Ontology for Untranslatability in NLP

A new ontology and taxonomy of compensation strategies for untranslatable cases are introduced, enabling controlled analysis of machine translation. A multilingual dataset pairs untranslatable sentences with strategy-based translations, showing human preference for outputs that include explanatory context, known as the Annotation compensation strategy.

arxiv arXiv cs.CL · 10d ago

LLMs Outperform Humans in Next Speaker Prediction

Large language models outperformed humans and supervised models in next speaker prediction using the AMI corpus, despite lacking audio-visual data and domain training. Multimodal LLMs surpassed text-based LLMs in addressee and turn-change detection but still fell short of human performance, highlighting challenges in utilizing raw audio-visual signals. Ablation studies show conversational context is crucial, especially for next speaker prediction, with both humans and LLMs struggling during frequent turn changes.

arxiv arXiv cs.CL · 10d ago

The Slop Paradox: AI Rewriting Degrades Clinical Uncertainty and Cross-Modal Alignment

AI-rewritten radiology reports show significant information loss, with EHR summarization eroding 51.4% of clinical entities and 43.7% of hedging language. Despite preserving image-text alignment, standardized and teaching case tasks reduce cross-modal alignment by 14.9-16.5%, six to seven times more than EHR summarization. The study finds no preferential degradation of rare pathologies and identifies rewriting task type as the key driver of degradation, not clinical content.

arxiv arXiv cs.CL · 10d ago

ChLogic: Testing Logical Reasoning Robustness in Chinese Expressions

ChLogic evaluates how well large language models maintain logical reasoning when English logical structures are expressed in Chinese. It reveals a persistent English-Chinese performance gap, with back-translation improving results on general items but harming performance on difficult problems. The benchmark highlights the impact of surface realization, translation artifacts, and model-specific behaviors on multilingual reasoning.

arxiv arXiv cs.AI · 11d ago

CrossMaps: Confidence-Aware Semantic Mapping for Rover Navigation

CrossMaps is a real-time, confidence-aware semantic mapping pipeline that uses RGB-D data to create language-queryable maps. It integrates multi-scale CLIP embeddings with a dual-memory architecture—Short-Term and Long-Term Memory—to aggregate visual observations and promote coherent, confident cells as persistent semantic landmarks. The system enables natural language queries to guide rover navigation via semantic heatmaps.

arxiv arXiv cs.AI · 11d ago

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

FusionRS introduces the first large-scale RGB-infrared-text dataset for remote sensing vision-language modeling. It aligns RGB and infrared images with IR-aware captions, enabling dual-modal vision-language foundation models. Experiments show improved RGB-IR alignment, retrieval, and captioning, with ablation studies confirming the critical role of modality-specific textual supervision.