Multimodal — korshunov.ai

Topic · Multimodal

ViGOS introduces a visually grounded on-policy self-distillation framework for multimodal large language models. It decouples perception and reasoning by using an image-only teacher for visual descriptions and a reasoning teacher for final outputs, reducing reliance on text-only references. This approach improves image-grounded performance across multiple vision-language benchmarks.

arxiv arXiv cs.AI · 10d ago

RTSGameBench: An RTS Benchmark for Strategic Reasoning

RTSGameBench addresses limitations in existing RTS benchmarks by offering diverse gameplay, targeted competency diagnosis, and self-evolving scenario generation. It evaluates vision-language models in strategic reasoning under uncertainty, revealing that state-of-the-art models struggle with multiagent coordination and large-scale tasks.

arxiv arXiv cs.AI · 10d ago

ThinkDeception: Interpretable Multimodal Deception Detection Framework

ThinkDeception introduces a progressive reinforcement learning framework that enables interpretable multimodal deception detection. It leverages a step-by-step annotated Chain of Thought dataset and proposes Visual-Audio Consistency Group Relative Policy Optimization with a dynamic curriculum, enhancing reasoning quality and outperforming existing methods on mainstream benchmarks.

arxiv arXiv cs.LG · 10d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.AI · 11d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 11d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.CL · 11d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 11d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 11d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.

arxiv arXiv cs.CL · 12d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

arxiv arXiv cs.AI · 12d ago

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

BinTrack is a fully open-source spatial question answering agent that uses binary search over a robot's trajectory to locate answers. It achieves up to 22.8% higher accuracy than other open-source methods and matches closed-source model performance on the most challenging global category of the SpaceLocQA benchmark. The system also offers over 1.5x faster inference and introduces GangnamLoop, a real-world outdoor benchmark collected with a quadruped robot.

arxiv arXiv cs.LG · 9d ago

Semantic Robustness Certification for Vision-Language Models

This work introduces a framework that certifies vision-language model robustness under semantic-level transformations, using text prompts as proxies. It quantifies extent intervals for which predictions remain unchanged, without requiring additional data for each variation. Experiments on synthetic and real-world data demonstrate its effectiveness across diverse semantic variations.

arxiv arXiv cs.LG · 9d ago

Latent SDEs for Anomaly Detection in Sparse Multivariate Time Series

We propose a generative method using Latent SDEs to detect anomalies in sparse and irregular multivariate time series. The approach projects observed data onto continuous-time stochastic systems, handling missing values and irregular sampling while capturing cyclic patterns. Experiments on six benchmark datasets show our method achieves top performance, outperforming state-of-the-art baselines, especially under severe data sparsity.

arxiv arXiv cs.LG · 9d ago

ChronoSurv: A Graph Framework for Multimodal Survival Analysis

ChronoSurv introduces a hierarchical directed graph framework that models patient care as a progression-aware clinical trajectory. It achieves state-of-the-art performance in multimodal survival prediction by capturing structured clinical workflows and handling missing data through heterogeneous message passing.

arxiv arXiv cs.CL · 10d ago

Fair Cognitive Impairment Detection Through Unlearning

A multimodal framework combines speech, text, and image data with gradient reversal unlearning to reduce demographic bias in Mild Cognitive Impairment detection. The method outperforms existing multilingual and multimodal baselines on TAUKADIAL and PREPARE, with reduced performance gaps across sex and language subgroups, and shows improved transfer across datasets.

arxiv arXiv cs.CL · 10d ago

Morpheus: Neural Tokenizer and Embedder for Turkish

Morpheus is a morphology-aware neural tokenizer and word embedder for Turkish that preserves original text through lossless encoding and decoding. It achieves the lowest bits-per-character (1.425), improves morphological alignment (MorphScore macro-F1 0.61), and uses 19% less GPU memory than 64K-vocabulary subword tokenizers. Frozen Morpheus embeddings outperform BGE-M3 and BERTurk in lexical retrieval, with root-family MAP of 0.85 and ROC-AUC of 1.00.

arxiv arXiv cs.CL · 10d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 10d ago

RPCL Improves Multimodal Emotion-Cause Pair Extraction

RPCL, a training-only framework, enhances pair confidence in multimodal emotion-cause pair extraction by enforcing discriminative and stable confidence margins. It outperforms a base model on ECF, MECAD, and MEC4 by 2.58 to 2.83 percentage points in Pair F1 and improves mean Pair AUPRC across datasets, with stronger separation between gold pairs and hard negatives.

arxiv arXiv cs.CL · 10d ago

Steerable Model Merging for Multilingual Reasoning

Steerable Model Merging (ST-Merge) introduces a gated cross-attention mechanism to adaptively weight source models during multilingual reasoning. It outperforms existing baselines on four multilingual reasoning benchmarks across 21 languages by dynamically prioritizing models based on input characteristics.

arxiv arXiv cs.CL · 10d ago

IndicContextEval: Benchmark for Context Utilisation in Audio LLMs

IndicContextEval introduces a 56-hour multilingual benchmark featuring natural speech from 555 speakers across 8 Indian languages and 23 domains. It employs a 7-level prompting framework to progressively test context utilisation, including metadata, descriptions, and adversarial inputs. Evaluation of five models shows significant differences in contextual grounding, underscoring the need for explicit assessment of context use in AudioLLMs.

ViGOS: Decoupling Perception and Reasoning in Multimodal On-Policy Self-Distillation

RTSGameBench: An RTS Benchmark for Strategic Reasoning

ThinkDeception: Interpretable Multimodal Deception Detection Framework

Vision-language models don't always need images for chest X-ray accuracy

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

MambaCount: Efficient Text-guided Object Counting

Vision-language models don't always need images for chest X-ray accuracy

ContextRL: Context-Aware RL for LLMs

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

Semantic Robustness Certification for Vision-Language Models

Latent SDEs for Anomaly Detection in Sparse Multivariate Time Series

ChronoSurv: A Graph Framework for Multimodal Survival Analysis

Fair Cognitive Impairment Detection Through Unlearning

Morpheus: Neural Tokenizer and Embedder for Turkish

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

RPCL Improves Multimodal Emotion-Cause Pair Extraction

Steerable Model Merging for Multilingual Reasoning

IndicContextEval: Benchmark for Context Utilisation in Audio LLMs