Multimodal — korshunov.ai

Multimodal Page 1 / 8

MMGist: A Comprehensive Multimodal Benchmark for 2027

MMGist is a curated multimodal benchmark with 7,262 items, designed to address flaws in existing vision-language benchmarks. It reduces evaluation size by 69% and improves cross-model discrimination by 78%, while preserving model rankings with a Spearman correlation of 0.98. The benchmark highlights visual logic as a key weakness and emphasizes the importance of visual dependency, discriminative power, and reliability in evaluation.

arxiv arXiv cs.AI · 11h ago

Efficient Multimodal Models for Pulmonary Embolism Risk Assessment

A benchmark using efficient multimodal large language models evaluates PE diagnosis and risk prediction on the INSPECT dataset. Results show Gemma4 E4B and E2B outperform others when EHR data is available, with PE diagnosis achieving higher accuracy than prognostic tasks like readmission prediction.

arxiv arXiv cs.AI · 14h ago

Deep Learning Pipeline for Sign Language Recognition and Translation to Indian Vernaculars

A two-stage deep learning pipeline classifies Indian sign language video clips into English words using a fine-tuned VideoMAE model and translates them into Hindi, Telugu, and Bengali via the NLLB-200 multilingual model. The system achieves 99% training and 78% validation accuracy on a 13-class, 197-clips dataset with uniform 16-frame clips at 22-224 resolution, and includes a Streamlit demo for user-uploaded videos with per-class analysis and failure mode identification.

arxiv arXiv cs.AI · 14h ago

Gazer: Training-Free Semantic Correction for Autoregressive Visual Models

Gazer introduces a training-free framework that uses multimodal large language model feedback to correct semantic errors in real time during autoregressive visual model generation. By integrating reflective diagnosis and semantic correction stages, Gazer improves compositional accuracy and semantic alignment across multiple models without additional training.

arxiv arXiv cs.AI · 15h ago

Multimodal Chain-of-Thought: Capabilities and Limitations

Multimodal Chain-of-Thought reasoning improves performance in mathematical and scientific reasoning but harms visual grounding and object counting in perception tasks. Models exhibit a 'Look Light, Think Heavy' pattern, where visual reflection diminishes while verbal reasoning increases, indicating a persistent bottleneck in visual introspection during multimodal reasoning.

arxiv arXiv cs.AI · 15h ago

SmartSDG Pipeline Enhances Syn-to-Real Object Detection

The paper introduces SmartSDG, an automated pipeline using NVIDIA Isaac Sim and Physically-Based Shading to optimize synthetic-to-real domain adaptation. It shows that indirect lighting and complex backgrounds improve object detection by preserving surface textures and reducing false positives, outperforming conventional direct-light synthetic data.

media r/LocalLLaMA · 16h ago

Baidu's Unlimited-OCR Transcribes Dozens of Pages in One Forward Pass

Baidu has released Unlimited-OCR, a model that transcribes dozens of pages in a single forward pass using Reference Sliding Window Attention (R-SWA). It builds on DeepSeek-OCR, inheriting its encoder, image compression, and MoE architecture, with only 500M active parameters per token. The model achieves 93.92% accuracy on OmniDocBench v1.6, outperforming DeepSeek-OCR's 87.01% on v1.5, though vendor-reported results warrant independent validation.

arxiv arXiv cs.LG · 17h ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining raw multimodal data to align with user and downstream intents. It uses a two-stage pipeline grounded in factual anchors to generate a large-scale dataset across five domains and combines supervised fine-tuning with GRPO to achieve strong alignment with complex refinement tasks. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density tailored data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.LG · 18h ago

Neural Action Codec for Vision-Language-Action Models

NAC, a neural audio codec-inspired architecture, compresses robot action trajectories as multi-channel 1D signals using multi-scale residual vector quantization. By replacing mel-spectrogram losses with time-domain and non-mel spectral reconstruction, NAC achieves high-fidelity action encoding with minimal architectural changes, outperforming existing tokenizers in reconstruction error and success rates on real-world manipulation tasks.

arxiv arXiv cs.LG · 18h ago

Atomistic Language Models Understand and Generate Materials

Atomistic Language Models (ALMs) unify language and atomistic structures, enabling natural language-driven crystal generation and optimization. ALMs use a continuous bridge to map language embeddings into atomistic diffusion steering space and employ Text-to-Crystal Feynman-Kac for stoichiometric accuracy. The ALM Bench benchmark evaluates text-conditioned material generation and optimization, with code and weights to be released soon.

arxiv arXiv cs.LG · 19h ago

ASCII Art Enables Text-Only LLMs to Control VLA Systems

A text-only large language model can be adapted into a Vision--Language--Action controller by using ASCII-rendered visual observations. This approach allows LLMs to interpret visual states through text, enabling them to follow natural-language instructions and generate executable actions in both simulation and on physical manipulators.

arxiv arXiv cs.LG · 19h ago

Decoupling Declarative and Procedural Knowledge in Vision-Language-Action Models

w$^{2}$VLA introduces a modular approach that decouples declarative and procedural knowledge in Vision-Language-Action models. By restructuring information flow, it enables robust behavior cloning and unprecedented zero-shot skill transfer across unseen, dissimilar objects.

media r/LocalLLaMA · 20h ago

Qwen releases 35B-parameter MoE for agent environment simulation

Qwen has launched Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with only about 3B active parameters per token. It is trained to simulate responses from MCP, terminal, software engineering, Android, web, and OS GUI environments by predicting next observations after agent actions, enabling efficient agent training and environment simulation without real tool execution.

arxiv arXiv cs.CL · 21h ago

ParaPairAudioBench: Benchmark for Paralinguistic Speech Evaluation

ParaPairAudioBench introduces a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions. It reveals that current LALM judges lag human judgments by 32% on average and fail to calibrate, especially in tie cases where abstention is correct.

arxiv arXiv cs.CL · 23h ago

MMed-Bench-IR: A Multilingual Medical Retrieval Benchmark

MMed-Bench-IR introduces a heterogeneous benchmark for multilingual medical information retrieval across six languages. It evaluates cross-lingual alignment, concept discrimination, and evidence retrieval through three distinct tasks with no overlapping concepts or queries. Evaluation shows significant cross-lingual performance drops, with English biomedical encoders falling from 0.818 to 0.056 nDCG@10 when transitioning to Japanese, highlighting limitations undetected by English-only benchmarks.

arxiv arXiv cs.CL · 23h ago

AVOC: Retrieval-Inspired Token Compression for Long-Form Audio-Video Understanding

AVOC enhances long-form audio-video understanding in omni-modal LLMs by introducing a learnable token compression module. It reframes token selection as a top-K retrieval problem, using relevance, importance, and diversity criteria to select compact, informative tokens, achieving state-of-the-art results on OmniVideoBench and LVOmniBench, and maintaining strong performance on one-hour audio-video needle-in-a-haystack tasks.

arxiv arXiv cs.AI · 1d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.AI · 1d ago

Extraction and Analysis of Multimodal Concepts in Vision Language Models

A new framework using Sparse Autoencoders extracts and analyzes visual, textual, and multimodal concepts from Vision Language Models. Experiments on LLaVA-NeXT show up to 45% improvement in visual concept quality and systematic identification of multimodal concepts, offering a structured approach to understanding VLM internal representations.

arxiv arXiv cs.AI · 1d ago

FleetAgent: Efficient Teleoperation for Autonomous Fleets

FleetAgent is a cloud-hosted multimodal large language model that processes compact vectorized vehicle-to-network messages to enable efficient, explainable teleoperation. It reduces uplink payload by up to 625 times and KV-cache memory by 625 times compared to raw images or text, and outperforms Qwen2.5-VL-7B on Lingo-Judge and intervention failure rates on the VecEval dataset.

arxiv arXiv cs.AI · 1d ago

FastGAN and Transformer Models Improve Aphid Detection in Faba Beans

A study uses FastGAN to generate 10,000 synthetic hyperspectral images of faba bean leaves, preserving real spectral and structural features. Transformer-based models, particularly Vision Transformer, achieve the highest accuracy and F1-scores in classifying healthy versus aphid-infested leaves, outperforming classical CNNs and demonstrating improved disease detection with reduced false negatives.