Multimodal — korshunov.ai

Multimodal Page 2 / 8

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining multimodal data to align with user and downstream intents. It uses a two-stage pipeline with factual anchors to generate a large-scale dataset across five domains and achieves strong alignment via supervised fine-tuning and GRPO. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.AI · 1d ago

EnTrust: Modeling Inter-Modal Conflict for Trustworthy Multimodal Medical Image Analysis

EnTrust introduces a framework that treats inter-modal conflict as the primary source of predictive uncertainty in medical image analysis. It decomposes multimodal features into shared consensus, modality-specific cues, and conflict signals, enabling calibrated, pixel-wise uncertainty estimation through a diffusion-based model and trust mapping. EnTrust achieves state-of-the-art segmentation accuracy, reduces calibration error by 40%, and outperforms 5x deep ensembles with half the memory footprint.

arxiv arXiv cs.AI · 1d ago

MIRCaps: Large-Scale Mixed-Domain Vision-Language Dataset

MIRCaps introduces a large-scale multimodal dataset with 141,364 images, 981,947 image-level captions, 1,742,264 region-level captions, and 5,391,779 bounding box annotations. It enables fine-grained vision-language learning by providing detailed captions for object categories, sizes, colors, actions, and environmental context, and demonstrates effectiveness in image captioning and object detection tasks.

arxiv arXiv cs.AI · 1d ago

Explainable AI Model for Career-Related Depression in University Students

A new Explainable AI framework uses structured behavioral data and facial emotion features to detect early signs of career-related depression and anxiety in university students. The model, evaluated on Pakistani student data, achieves an F1-score of 89.12% and identifies key markers like avoidance of direct gaze and social withdrawal, aligning with psychological theory.

arxiv arXiv cs.AI · 1d ago

Decoupling Declarative and Procedural Knowledge in Vision-Language-Action Models

w$^{2}$VLA introduces a modular vision-language-action model that decouples declarative and procedural knowledge. By restructuring information flow, it enables robust behavior cloning and zero-shot skill transfer to novel, dissimilar objects.

lab Mistral AI News · 2d ago

Mistral Releases OCR 4 with Multilingual Support and Structured Output

Mistral OCR 4 introduces bounding boxes, block classification, and inline confidence scores for 170 languages across 10 language groups. It outperforms leading OCR systems in human preference evaluations with a 72% win rate and achieves the top score on OlmOCRBench (85.20), while offering self-hosted deployment in a single container and supporting enterprise use cases like RAG and document ingestion.

arxiv arXiv cs.CL · 2d ago

Comparative Evaluation of MT Systems and Post-Editor Groups in Specialised Translation

The study compares three MT systems—DeepL, eTranslation, and Systran—and two post-editor groups: linguists/translators and NLP experts. Results show significant differences in terminological accuracy and fluency, emphasizing the role of domain knowledge in specialised translation and the variable performance of MT systems in language-specific contexts.

arxiv arXiv cs.CL · 2d ago

PIVOTSBench: Benchmark for Fine-Grained Interpersonal Reasoning in MLLMs

PIVOTSBench is the first benchmark that evaluates multimodal large language models' ability to reason about bidirectional interpersonal relationships using Social-IQ 2.0 and YouTube data. It includes auxiliary tasks to assess visual cue identification and conducts ablation studies on visual modalities and social role information, analyzing how joint and pairwise predictions improve performance on relationship dimensions grounded in psychology research.

arxiv arXiv cs.CL · 2d ago

AI-Constructed Brand Reputation Is Language-Bound

AI-generated brand reputations vary significantly by language, with Uralic and Baltic languages showing more positive sentiment and Germanic languages, including English, being more critical. Query language impacts which brands are recommended, especially for local champions, where home-language queries increase visibility by 0.80 points compared to English queries. English-only monitoring fails to capture the full AI visibility of locally headquartered brands, creating a measurable language blind spot.

arxiv arXiv cs.CL · 2d ago

CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

CFPO introduces a cross-modal counterfactual enhancement mechanism to improve causal consistency between visual perception and textual reasoning in vision-language models. It achieves 3.17%-6.25% gains over standard RL baselines and 1.32%-2.13% over PAPO, without requiring external rewards or supervision.

arxiv arXiv cs.CL · 2d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning with Verifiable Evolution

VeriEvol introduces a verifiable data-construction framework for visual mathematical reasoning, decoupling prompt difficulty and answer reliability. It evolves image-question prompts using type-aware operators and verifies answers via multi-source counter-evidence falsification. On five benchmarks, scaling from 10K to 250K samples improves mean accuracy from 35.42 to 54.73, with a cumulative +3.88 over baseline, driven by evolved prompts and HTV-Agent verification.

arxiv arXiv cs.CL · 2d ago

CapRiCorn-1K: Benchmark for Video Captioning and Subject Consistency

CapRiCorn-1K is a benchmark that evaluates video captioning quality and subject referential consistency across different video durations and domains. It supports both audiovisual and visual-only settings, revealing that current models struggle to maintain consistent subject references, especially in longer videos, with caption quality and consistency declining as video length increases. The benchmark's metrics show strong alignment with downstream tasks, validating their effectiveness.

arxiv arXiv cs.CL · 2d ago

ViRGo: Adaptive Routing for Visual Retrieval and Global Perception

ViRGo introduces a lightweight framework that adapts visual retrieval based on object scale. It uses intrinsic localization and semantic confidence to route between global perception, patch-based retrieval, and attention-based retrieval, improving accuracy-efficiency trade-offs without extra computation.

arxiv arXiv cs.CL · 2d ago

Moshi-Face: Full-Duplex Dialogue with Facial Generation

Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial input, generating both speech and synchronized facial motion. It uses a VQ-VAE face codec to encode and reconstruct 3D head meshes from facial videos into discrete face tokens, and a Face Transformer module to generate these tokens non-autoregressively for real-time audiovisual output. Experiments show Moshi-Face achieves audiovisual alignment with low latency while maintaining original dialogue quality.

arxiv arXiv cs.CL · 2d ago

TSCognition and TSAlign Advance Time Series Reasoning with LLMs

TSCognition introduces a multimodal benchmark with 41K QA samples across five cognitive reasoning tasks. TSAlign outperforms existing models on TSCognition and TimerBed while reducing computational cost, using patch-level representations and alignment in LLM embedding space.

arxiv arXiv cs.CL · 2d ago

BioMatrix: First Natively Multimodal Biological Foundation Model

BioMatrix integrates sequences, structures, and language for molecules and proteins in a single decoder-only architecture. It achieves state-of-the-art or competitive performance on 77 out of 80 downstream tasks, demonstrating effective multimodal generalist capabilities without external components.

arxiv arXiv cs.CL · 2d ago

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by text continuation and return to speech. This behavior is driven by interleaving data and text LM initialization, correlating with spoken knowledge performance.

arxiv arXiv cs.CL · 2d ago

ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models

ROMEVA addresses sub-word fragmentation in Roman Urdu by combining sub-word-average initialization and PCA-guided anchor loss to stabilize embeddings. While ROMEVA best preserves pretrained embeddings, naive fine-tuning achieves superior sentiment classification performance, indicating a trade-off between embedding stability and downstream performance in morphologically inconsistent languages.

arxiv arXiv cs.CL · 2d ago

Gazer: Training-Free Semantic Correction for Autoregressive Visual Models

Gazer introduces a training-free framework that uses multimodal large language model feedback to correct semantic errors in real time during autoregressive visual model generation. By integrating reflective diagnosis and semantic correction stages, Gazer improves compositional accuracy and semantic alignment across multiple models without additional training.

arxiv arXiv cs.CL · 2d ago

Multimodal Chain-of-Thought: Capabilities and Limitations

Multimodal Chain-of-Thought reasoning improves performance in mathematical and scientific reasoning but harms visual grounding and object counting in perception tasks. Models exhibit a 'Look Light, Think Heavy' pattern, where visual reflection diminishes while verbal reflection increases, indicating a persistent bottleneck in visual reasoning.