Multimodal
arxiv arXiv cs.AI · 1d ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining multimodal data to align with user and downstream intents. It uses a two-stage pipeline with factual anchors to generate a large-scale dataset across five domains and achieves strong alignment via supervised fine-tuning and GRPO. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.AI · 1d ago

EnTrust: Modeling Inter-Modal Conflict for Trustworthy Multimodal Medical Image Analysis

EnTrust introduces a framework that treats inter-modal conflict as the primary source of predictive uncertainty in medical image analysis. It decomposes multimodal features into shared consensus, modality-specific cues, and conflict signals, enabling calibrated, pixel-wise uncertainty estimation through a diffusion-based model and trust mapping. EnTrust achieves state-of-the-art segmentation accuracy, reduces calibration error by 40%, and outperforms 5x deep ensembles with half the memory footprint.

lab Mistral AI News · 2d ago

Mistral Releases OCR 4 with Multilingual Support and Structured Output

Mistral OCR 4 introduces bounding boxes, block classification, and inline confidence scores for 170 languages across 10 language groups. It outperforms leading OCR systems in human preference evaluations with a 72% win rate and achieves the top score on OlmOCRBench (85.20), while offering self-hosted deployment in a single container and supporting enterprise use cases like RAG and document ingestion.

arxiv arXiv cs.CL · 2d ago

PIVOTSBench: Benchmark for Fine-Grained Interpersonal Reasoning in MLLMs

PIVOTSBench is the first benchmark that evaluates multimodal large language models' ability to reason about bidirectional interpersonal relationships using Social-IQ 2.0 and YouTube data. It includes auxiliary tasks to assess visual cue identification and conducts ablation studies on visual modalities and social role information, analyzing how joint and pairwise predictions improve performance on relationship dimensions grounded in psychology research.

arxiv arXiv cs.CL · 2d ago

AI-Constructed Brand Reputation Is Language-Bound

AI-generated brand reputations vary significantly by language, with Uralic and Baltic languages showing more positive sentiment and Germanic languages, including English, being more critical. Query language impacts which brands are recommended, especially for local champions, where home-language queries increase visibility by 0.80 points compared to English queries. English-only monitoring fails to capture the full AI visibility of locally headquartered brands, creating a measurable language blind spot.

arxiv arXiv cs.CL · 2d ago

VeriEvol: Scaling Multimodal Mathematical Reasoning with Verifiable Evolution

VeriEvol introduces a verifiable data-construction framework for visual mathematical reasoning, decoupling prompt difficulty and answer reliability. It evolves image-question prompts using type-aware operators and verifies answers via multi-source counter-evidence falsification. On five benchmarks, scaling from 10K to 250K samples improves mean accuracy from 35.42 to 54.73, with a cumulative +3.88 over baseline, driven by evolved prompts and HTV-Agent verification.

arxiv arXiv cs.CL · 2d ago

CapRiCorn-1K: Benchmark for Video Captioning and Subject Consistency

CapRiCorn-1K is a benchmark that evaluates video captioning quality and subject referential consistency across different video durations and domains. It supports both audiovisual and visual-only settings, revealing that current models struggle to maintain consistent subject references, especially in longer videos, with caption quality and consistency declining as video length increases. The benchmark's metrics show strong alignment with downstream tasks, validating their effectiveness.

arxiv arXiv cs.CL · 2d ago

Moshi-Face: Full-Duplex Dialogue with Facial Generation

Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial input, generating both speech and synchronized facial motion. It uses a VQ-VAE face codec to encode and reconstruct 3D head meshes from facial videos into discrete face tokens, and a Face Transformer module to generate these tokens non-autoregressively for real-time audiovisual output. Experiments show Moshi-Face achieves audiovisual alignment with low latency while maintaining original dialogue quality.

arxiv arXiv cs.CL · 2d ago

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by text continuation and return to speech. This behavior is driven by interleaving data and text LM initialization, correlating with spoken knowledge performance.

arxiv arXiv cs.CL · 2d ago

ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models

ROMEVA addresses sub-word fragmentation in Roman Urdu by combining sub-word-average initialization and PCA-guided anchor loss to stabilize embeddings. While ROMEVA best preserves pretrained embeddings, naive fine-tuning achieves superior sentiment classification performance, indicating a trade-off between embedding stability and downstream performance in morphologically inconsistent languages.