Multimodal — korshunov.ai

Topic · Multimodal

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a single panoramic canvas using 3D world coordinates. It achieves state-of-the-art performance on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using significantly less training compute than existing methods.

arxiv arXiv cs.AI · 9d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.

arxiv arXiv cs.CL · 9d ago

OmniAgent: Native Active Perception for Omni-Modal Understanding

OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).

arxiv arXiv cs.LG · 9d ago

ViGOS: Decoupling Perception and Reasoning in Multimodal On-Policy Self-Distillation

ViGOS introduces a visually grounded on-policy self-distillation framework for multimodal large language models. It decouples perception and reasoning by using an image-only teacher for visual descriptions and a reasoning teacher for final outputs, reducing reliance on text-only references. This approach improves image-grounded performance across multiple vision-language benchmarks.

arxiv arXiv cs.AI · 9d ago

RTSGameBench: An RTS Benchmark for Strategic Reasoning

RTSGameBench addresses limitations in existing RTS benchmarks by offering diverse gameplay, targeted competency diagnosis, and self-evolving scenario generation. It evaluates vision-language models in strategic reasoning under uncertainty, revealing that state-of-the-art models struggle with multiagent coordination and large-scale tasks.

arxiv arXiv cs.AI · 9d ago

ThinkDeception: Interpretable Multimodal Deception Detection Framework

ThinkDeception introduces a progressive reinforcement learning framework that enables interpretable multimodal deception detection. It leverages a step-by-step annotated Chain of Thought dataset and proposes Visual-Audio Consistency Group Relative Policy Optimization with a dynamic curriculum, enhancing reasoning quality and outperforming existing methods on mainstream benchmarks.

arxiv arXiv cs.LG · 10d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.AI · 10d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 10d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.CL · 10d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 10d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 10d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.

arxiv arXiv cs.CL · 11d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

arxiv arXiv cs.AI · 11d ago

BinTrack: Open-Source Spatial QA with Binary Trajectory Search

BinTrack is a fully open-source spatial question answering agent that uses binary search over a robot's trajectory to locate answers. It achieves up to 22.8% higher accuracy than other open-source methods and matches closed-source model performance on the most challenging global category of the SpaceLocQA benchmark. The system also offers over 1.5x faster inference and introduces GangnamLoop, a real-world outdoor benchmark collected with a quadruped robot.

arxiv arXiv cs.LG · 8d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

MedRLs enables long-context clinical reasoning by recursively inspecting patient data across text, images, sensors, and guidelines. It integrates specialized agents and a Clinical Evidence Graph Memory to connect observations with evidence and referral criteria, supporting sensor-triggered reasoning and uncertainty-gated clinician review.

arxiv arXiv cs.LG · 8d ago

De-biased VLM-as-3D-Judge Protocol for Furniture Generation

A de-biased VLM-based judge protocol specializes TRELLIS on furniture generation using lightweight adaptation. The protocol addresses failure modes like image overload and geometry-hiding, with calibration showing 0.83–1.0 win rates and base-vs-base symmetry at 0.5. Among six adaptation methods, conditioner repair under severe degradation achieves parity with the base model, while no method exceeds a 65% win-rate target.

arxiv arXiv cs.CL · 8d ago

NEST: Dataset for Narrative Event Structures in Long Videos

NEST introduces a dataset of 1005 full-length movies, each annotated with 102 multimodal narrative events grounded in visual, dialogue, and audio content. The dataset captures event relationships such as temporal ordering, hierarchy, and long-range dependencies, with benchmark tasks showing low performance in event detection and localization, and higher performance in event relation extraction after fine-tuning.

arxiv arXiv cs.CL · 8d ago

NRITYAM: Benchmark for Cultural Comprehension in Dance

NRITYAM is a multilingual benchmark with 9,260 question-answer pairs across 12 languages, designed to evaluate language models' cultural understanding of global dance traditions. Developed through collaboration with native dance artists and speakers, it offers a comprehensive assessment of AI's ability to grasp traditional performing arts in diverse socio-cultural contexts.

arxiv arXiv cs.CL · 8d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

MedRLs enables long-context clinical reasoning by recursively inspecting patient data across text, images, sensors, and guidelines. It integrates specialized agents and a Clinical Evidence Graph Memory to connect patient observations with evidence, biomarkers, and referral criteria, supporting sensor-triggered reasoning and uncertainty-gated clinician review.

media r/LocalLLaMA · 9d ago

The power of intelligence is better in the hands of the people than in the board rooms of tycoons

The PearlOS project has launched an open-source swarm intelligence platform that uses local models to handle multimodal tasks. It automatically selects and switches between top-performing models based on benchmarks, ensuring users always access the latest and most capable models without relying on closed-source systems or subscriptions.