Multimodal — korshunov.ai

Multimodal Page 4 / 8

See-and-Reach: Vision-Language Navigation for UAVs in Field of View

UAV-VLN-FOV isolates the see-and-reach stage for precise evaluation of UAV navigation. 3DG-VLN enhances visual grounding and spatial alignment using dynamic 3D direction cues, achieving a 13.82% success rate improvement over baselines and validated in real-world trials.

arxiv arXiv cs.AI · 6d ago

Hidden Evolution of Disguised Visual Context in VLMs

Visual tokens enter large language models as raw, unstructured signals. Their internal transformation and integration depend on architecture—either as in-context prompts or injected into intermediate layers—leading to distinct evolution paths in visual representation and frequency characteristics. We find that attention alone is insufficient; performance is driven by the quality of visual representations at each layer across different integration paradigms.

arxiv arXiv cs.AI · 6d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.AI · 6d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 6d ago

Frequency-Aware Flow Matching for Robotic Action Generation

Frequency-Aware Flow Matching (FAFM) enables continuous and temporally consistent robotic action generation by transforming discrete action sequences into the frequency domain using discrete cosine transform. It regularizes first-order temporal derivatives with a Sobolev-type constraint to ensure smooth actions, improving success rates, motion smoothness, and robustness across synthetic and real-world tasks without adding network parameters.

arxiv arXiv cs.AI · 6d ago

BIM-Edit: Benchmarking LLMs for IFC-Based BIM Editing

BIM-Edit introduces a benchmark to evaluate large language models on natural-language editing of Building Information Models in IFC format. It includes 324 editing tasks across 11 real and 36 synthetic building models, assessing geometric accuracy, semantic validity, and topological consistency. The best model achieves only 49.5% average score, with no model solving more than 3.4% of tasks, highlighting a significant gap in LLM capabilities for engineering design workflows.

arxiv arXiv cs.AI · 6d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

MedRLs enables long-context clinical reasoning by recursively inspecting patient data across text, images, sensors, and guidelines. It integrates specialized agents and a Clinical Evidence Graph Memory to connect observations with evidence and referral criteria, supporting sensor-triggered reasoning and uncertainty-gated clinician review.

arxiv arXiv cs.AI · 6d ago

RS-Neg Benchmark and NeFo Method for Negation Understanding in Remote Sensing MLLMs

RS-Neg is the first benchmark to evaluate negation comprehension in remote sensing tasks across region-level and scene-level scenarios. It reveals that advanced remote sensing MLLMs struggle with negation, showing hallucinations and performance drops. NeFo, a test-time learning method, improves negation understanding using only 5% unlabeled test data and generalizes well to new tasks.

arxiv arXiv cs.AI · 6d ago

HilDA: Hierarchical Distillation with Diffusion for Self-Supervised LiDAR Pretraining

HilDA introduces a self-supervised pretraining framework for LiDAR backbones that uses hierarchical distillation and temporal occupancy diffusion to improve semantic and geometric understanding. It achieves state-of-the-art results on cross-modal distillation benchmarks and outperforms prior methods in 3D object detection, scene flow, and semantic occupancy prediction.

arxiv arXiv cs.AI · 6d ago

FlowMaps Models Long-Term Multimodal Object Dynamics

FlowMaps is a latent flow matching model that predicts future object locations in 3D environments by learning spatio-temporal patterns from human interactions. It outperforms state-of-the-art methods in dynamic object navigation across over 600 episodes in both simulated and real-world settings.

arxiv arXiv cs.AI · 6d ago

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E introduces a test-time method that uses visual spotlights to enhance evidence grounding in frozen vision-language models. It employs low-entropy anchors and an entropy-shaping objective to reduce answer uncertainty while preserving high-confidence tokens, improving robustness under visual corruptions across benchmarks and VLM families.

arxiv arXiv cs.AI · 6d ago

Lagrange: Open-Vocabulary Sparse Framework for End-to-End Driving

Lagrange introduces an open-vocabulary, energy-based sparse framework for generalized end-to-end driving. It uses Vision-Language Models to generate class-agnostic object proposals and encodes them into continuous semantic tokens, enabling robust generalization to anomalous scenarios while adhering to vehicle kinematics through Lagrangian action minimization.

arxiv arXiv cs.AI · 6d ago

ELVA: A Ranking-Driven Framework for Multimodal Retrieval

ELVA introduces a rule-based reinforcement learning framework to address grain blindness in multimodal retrieval. By using verifiable rewards and differentiating negative samples based on similarity, ELVA improves ranking precision and achieves a 13.1% gain on MRBench, a benchmark for multi-grain query scenarios.

arxiv arXiv cs.LG · 6d ago

Alzheimer's Diagnosis via Multimodal 3D MRI and PET Fusion

A new study combines 3D MRI and PET data using advanced fusion strategies including GMU and gated self-attention, along with a sparsely gated MoE classifier. Results show GMU achieves 80.46% accuracy on NC vs. MCI and 95.47% on NC vs. AD, with gated self-attention reaching 82.08% on MCI vs. AD. Ablations confirm the MoE significantly improves performance, highlighting the importance of input-adaptive multimodal modeling for accurate Alzheimer's diagnosis.

arxiv arXiv cs.LG · 6d ago

PaAno+: Lightweight Time Series Anomaly Detection with Multiscale and Cross-Variable Attention

PaAno+ introduces a lightweight model that uses multiscale convolution and cross-variable attention to improve time series anomaly detection. It achieves state-of-the-art accuracy on both univariate and multivariate tasks, with superior performance in VUS-PR and other metrics, while maintaining efficient computation for real-time deployment on resource-limited devices.

arxiv arXiv cs.LG · 6d ago

Pose6DAug: Physically Plausible Multi-view Object Swapping

Pose6DAug enables robot data augmentation by swapping objects in successful episodes while preserving physically valid 6D pose trajectories. It operates in 3D using a mesh anchored by temporally coherent poses, ensuring multi-view consistency and physical plausibility. Fine-tuning a VLA policy on this augmented data improves novel object success rates by 16.5% over state-of-the-art baselines.

arxiv arXiv cs.LG · 6d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

arxiv arXiv cs.LG · 6d ago

MELT and SALT: Multimodal Contrastive Learning for Earth Embeddings

MELT and SALT are multimodal contrastive learning models that use unpaired geospatial data to improve location embeddings. Both achieve performance equal to the best two-modality baseline across four tasks, but adding more modalities does not consistently boost results, indicating the location encoder's design is the primary performance limit. MELT offers more stable training and is better suited for future model scaling.

arxiv arXiv cs.LG · 6d ago

Machine Learning Predicts Gestational Age from Fetal MRI

A machine learning pipeline using multi-modal fetal MRI data predicts gestational age at birth with an R2 of 0.13 and a mean absolute error of 2.74 weeks. It achieves 0.77 accuracy, 0.59 sensitivity, and 0.82 specificity, with cervical length and placental T2* statistics as key features. This work presents a proof of concept for predicting preterm birth using MRI and machine learning.

arxiv arXiv cs.LG · 6d ago

Computational Methods for Cell-Free DNA in Multi-Cancer Early Detection

This review outlines computational methods from 2022 to 2025 for detecting multiple cancers from blood-based cell-free DNA. It evaluates fragmentomics and epigenetic analysis, covering statistical, machine learning, and deep learning approaches, with a focus on biological interpretability, validation, and clinical readiness. Multimodal ensemble methods show the highest promise for clinical use, but standardized evaluation protocols are needed for reliable comparison and future progress.