Multimodal — korshunov.ai

Multimodal Page 1 / 8

Source Language Effects in Cross-Lingual In-Context Learning

A study finds that fine-tuning-based assumptions about cross-lingual transfer do not apply to in-context learning. The research reveals that source language selection in ICL requires new heuristics, especially for generative tasks where language confusion is a key challenge.

arxiv arXiv cs.LG · 8d ago

ST-CND Framework for Early Warning of Geographic Tipping Points

SpatioTemporal Causal Network Diagnostics (ST-CND) introduces a data-driven framework to detect geographic tipping points by modeling spatial fields as time-evolving causal networks. It outperforms existing methods on sea-surface temperature benchmarks, achieving an AUROC of 0.783 and a critical-subnetwork IoU of 0.378 for the North Atlantic AMOC.

arxiv arXiv cs.LG · 8d ago

Physics-Constrained Neural Networks Improve Weather Forecasting

A study enhances physics-constrained neural networks by introducing an upgraded numerical solver, a unified autoregressive block, and two neural backbones. These improvements reduce root mean squared error by 8-22% in short-term forecasts over the South Pacific and better preserve physical consistency.

arxiv arXiv cs.LG · 8d ago

ASTEROID: Transformer for Multi-Step MD Forecasting

ASTEROID is a data-driven framework that predicts multi-step atomic coordinates in molecular dynamics simulations without iterative integration. It uses a spatiotemporal Transformer architecture to model multiscale dependencies, achieving higher accuracy and reduced computational cost compared to existing methods on quantum-mechanics derived datasets.

arxiv arXiv cs.LG · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.LG · 8d ago

Order-Independent Cell-Level Representations for Multi-Task Table Recognition

This paper introduces a structural refinement module using non-causal attention to generate order-independent cell features in autoregressive multi-task table recognition. The approach enables parallel cell content inference while maintaining global context, improving cell localization and end-to-end recognition with a threefold reduction in inference time.

arxiv arXiv cs.LG · 8d ago

CERS: CoT-Enhanced Reasoning for Medical Image Segmentation

CERS introduces Chain-of-Thought reasoning to improve semi-supervised medical image segmentation by integrating linguistic descriptions from large language models. It uses a semantic-aware reference selection and multi-scale coordinate attention to resolve boundary ambiguities and semantic inconsistencies, outperforming state-of-the-art methods in clinical scenarios with visual-semantic mismatch.

arxiv arXiv cs.AI · 8d ago

Semantics-First Latent Modeling for 3D MRI Reconstruction

A new framework prioritizes anatomical semantics during 3D MRI latent compression, addressing long-range coherence and clinical detail loss. It introduces a Latent Harmonization Encoder and Semantic Recovery Block to preserve meaningful structures, and an Anatomy-aware Frequency Loss to maintain high-frequency diagnostic features. Experiments on public MRI datasets show improved reconstruction and cross-contrast synthesis quality.

arxiv arXiv cs.AI · 8d ago

Source Language Effects in Cross-Lingual In-Context Learning

A study finds that fine-tuning-based assumptions about cross-lingual transfer do not apply in few-shot In-Context Learning. The research reveals that source language selection significantly impacts performance and identifies new heuristics for effective cross-lingual ICL.

arxiv arXiv cs.AI · 8d ago

Quality-Aware Self-Distillation for GUI Grounding

A new method improves GUI grounding by using soft correctness-aware gating and teacher-probability scaling to enhance coordinate-token teacher signals. These components work together to suppress unreliable supervision and calibrate remaining signals, with experiments showing consistent performance gains across six benchmarks.

arxiv arXiv cs.AI · 8d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 8d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.CL · 8d ago

MLLP-VRAIN's Simultaneous Speech Translation Submission for IWSLT 2026

The MLLP-VRAIN group submits a cascaded SimulST system using Parakeet and Qwen 3.5 models with adaptive black-box policies. For En→De, It, Zh, it employs ASR word-boosting and RAG with pre-translated exemplars in the new context track, achieving +5.82 XCOMET-XL improvement on MCIF En→De and an additional +1.03 gain via context integration.

arxiv arXiv cs.CL · 8d ago

Soft Prompting for Language Adherence in Multimodal LLMs

A soft prompting approach is proposed to improve language adherence in multimodal LLMs without strict output constraints. The method introduces a new metric to quantify language violations and evaluates three strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning. Results show effectiveness in reducing language violations while preserving ASR performance across multiple languages, with trade-offs considered under different compute constraints.

arxiv arXiv cs.CL · 8d ago

SpeechDx: Multi-Task Benchmark for Clinical Speech AI

SpeechDx introduces a large-scale benchmark with 12 datasets and 27 tasks across diverse health conditions. It evaluates models by speech production stages and reveals that large-scale models perform best, while domain-specific models show limited generalization across clinical conditions.

arxiv arXiv cs.CL · 8d ago

Operationalizing Ontology for Untranslatability in NLP

A new ontology and taxonomy of compensation strategies for untranslatable cases are introduced, enabling controlled analysis of machine translation. A multilingual dataset pairs untranslatable sentences with strategy-based translations, showing human preference for outputs that include explanatory context, known as the Annotation compensation strategy.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

LLMs Outperform Humans in Next Speaker Prediction

Large language models outperformed humans and supervised models in next speaker prediction using the AMI corpus, despite lacking audio-visual data and domain training. Multimodal LLMs surpassed text-based LLMs in addressee and turn-change detection but still fell short of human performance, highlighting challenges in utilizing raw audio-visual signals. Ablation studies show conversational context is crucial, especially for next speaker prediction, with both humans and LLMs struggling during frequent turn changes.

arxiv arXiv cs.CL · 8d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.