Multimodal
arxiv arXiv cs.AI · 6d ago

Hidden Evolution of Disguised Visual Context in VLMs

Visual tokens enter large language models as raw, unstructured signals. Their internal transformation and integration depend on architecture—either as in-context prompts or injected into intermediate layers—leading to distinct evolution paths in visual representation and frequency characteristics. We find that attention alone is insufficient; performance is driven by the quality of visual representations at each layer across different integration paradigms.

arxiv arXiv cs.AI · 6d ago

IHUBERT: Persian Pretrained Model with Semantic Deduplication

IHUBERT is a monolingual Persian pretrained language model trained on a 45 GB curated subset of the Sepahr-Danesh collection. It uses vector-based semantic deduplication and a domain-balanced pretraining pipeline to improve corpus quality and reduce redundancy, achieving top performance in extractive question answering and strong results in NER and topic classification, though relation extraction remains a challenge.

arxiv arXiv cs.AI · 6d ago

Dual-Agent Framework for Cross-Model Verified Translation

A dual-agent framework converts natural-language experiment protocols into executable commands for robotic lab platforms. It uses a Parser Agent and a rule-based mapping engine to translate protocols, with a heterogeneous LLM Validation Agent ensuring accuracy and triggering self-correction. The framework successfully enables end-to-end autonomous execution of microplate-based experiments like the Bradford assay.

arxiv arXiv cs.AI · 6d ago

Frequency-Aware Flow Matching for Robotic Action Generation

Frequency-Aware Flow Matching (FAFM) enables continuous and temporally consistent robotic action generation by transforming discrete action sequences into the frequency domain using discrete cosine transform. It regularizes first-order temporal derivatives with a Sobolev-type constraint to ensure smooth actions, improving success rates, motion smoothness, and robustness across synthetic and real-world tasks without adding network parameters.

arxiv arXiv cs.AI · 6d ago

BIM-Edit: Benchmarking LLMs for IFC-Based BIM Editing

BIM-Edit introduces a benchmark to evaluate large language models on natural-language editing of Building Information Models in IFC format. It includes 324 editing tasks across 11 real and 36 synthetic building models, assessing geometric accuracy, semantic validity, and topological consistency. The best model achieves only 49.5% average score, with no model solving more than 3.4% of tasks, highlighting a significant gap in LLM capabilities for engineering design workflows.

arxiv arXiv cs.AI · 6d ago

RS-Neg Benchmark and NeFo Method for Negation Understanding in Remote Sensing MLLMs

RS-Neg is the first benchmark to evaluate negation comprehension in remote sensing tasks across region-level and scene-level scenarios. It reveals that advanced remote sensing MLLMs struggle with negation, showing hallucinations and performance drops. NeFo, a test-time learning method, improves negation understanding using only 5% unlabeled test data and generalizes well to new tasks.

arxiv arXiv cs.LG · 6d ago

Alzheimer's Diagnosis via Multimodal 3D MRI and PET Fusion

A new study combines 3D MRI and PET data using advanced fusion strategies including GMU and gated self-attention, along with a sparsely gated MoE classifier. Results show GMU achieves 80.46% accuracy on NC vs. MCI and 95.47% on NC vs. AD, with gated self-attention reaching 82.08% on MCI vs. AD. Ablations confirm the MoE significantly improves performance, highlighting the importance of input-adaptive multimodal modeling for accurate Alzheimer's diagnosis.

arxiv arXiv cs.LG · 6d ago

MELT and SALT: Multimodal Contrastive Learning for Earth Embeddings

MELT and SALT are multimodal contrastive learning models that use unpaired geospatial data to improve location embeddings. Both achieve performance equal to the best two-modality baseline across four tasks, but adding more modalities does not consistently boost results, indicating the location encoder's design is the primary performance limit. MELT offers more stable training and is better suited for future model scaling.

arxiv arXiv cs.LG · 6d ago

Computational Methods for Cell-Free DNA in Multi-Cancer Early Detection

This review outlines computational methods from 2022 to 2025 for detecting multiple cancers from blood-based cell-free DNA. It evaluates fragmentomics and epigenetic analysis, covering statistical, machine learning, and deep learning approaches, with a focus on biological interpretability, validation, and clinical readiness. Multimodal ensemble methods show the highest promise for clinical use, but standardized evaluation protocols are needed for reliable comparison and future progress.