Multimodal — korshunov.ai

Multimodal Page 5 / 8

VibrantForests framework maps forest structure at 10-meter resolution

The VibrantForests framework uses satellite data trained on lidar samples to generate annual, wall-to-wall maps of canopy cover, height, biomass, basal area, and quadratic mean diameter at 10-meter resolution across the contiguous U.S. It improves accuracy by reducing overestimation in sparse forests and underestimation in dense forests, extending the range of reliable predictions beyond traditional passive-sensor models.

arxiv arXiv cs.LG · 6d ago

De-biased VLM-as-3D-Judge Protocol for Furniture Generation

A de-biased VLM-based judge protocol specializes TRELLIS on furniture generation using lightweight adaptation. The protocol addresses failure modes like image overload and geometry-hiding, with calibration showing 0.83–1.0 win rates and base-vs-base symmetry at 0.5. Among six adaptation methods, conditioner repair under severe degradation achieves parity with the base model, while no method exceeds a 65% win-rate target.

arxiv arXiv cs.CL · 6d ago

NEST: Dataset for Narrative Event Structures in Long Videos

NEST introduces a dataset of 1005 full-length movies, each annotated with 102 multimodal narrative events grounded in visual, dialogue, and audio content. The dataset captures event relationships such as temporal ordering, hierarchy, and long-range dependencies, with benchmark tasks showing low performance in event detection and localization, and higher performance in event relation extraction after fine-tuning.

arxiv arXiv cs.CL · 6d ago

NRITYAM: Benchmark for Cultural Comprehension in Dance

NRITYAM is a multilingual benchmark with 9,260 question-answer pairs across 12 languages, designed to evaluate language models' cultural understanding of global dance traditions. Developed through collaboration with native dance artists and speakers, it offers a comprehensive assessment of AI's ability to grasp traditional performing arts in diverse socio-cultural contexts.

arxiv arXiv cs.CL · 6d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

MedRLs enables long-context clinical reasoning by recursively inspecting patient data across text, images, sensors, and guidelines. It integrates specialized agents and a Clinical Evidence Graph Memory to connect patient observations with evidence, biomarkers, and referral criteria, supporting sensor-triggered reasoning and uncertainty-gated clinician review.

arxiv arXiv cs.CL · 6d ago

Algorithm for Pitch Spelling and Key Estimation in Music Transcription

A new algorithm estimates note names, key signatures, and local scales from MIDI-like input by jointly optimizing modal and tonal stages. It has been evaluated on jazz lead sheets, solo transcriptions, traditional tunes, and classical piano scores, with additional distances defined between common jazz scales for musicological research.

arxiv arXiv cs.CL · 6d ago

CzechDocs: Parallel Dataset for Minority Language Document Translation

CzechDocs is a multiway parallel dataset of formatted documents in HTML, DOCX, and PDF formats, covering Czech and minority languages such as Ukrainian, English, Vietnamese, and Russian. It supports evaluation of machine translation systems that preserve document formatting, with a validation subset and evaluation toolkit publicly released. A held-out test split will be used for a future shared task on document-level translation with formatting preservation.

media r/LocalLLaMA · 6d ago

LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M Released

LFM2.5-Embedding-350M is a dense bi-encoder that provides fast multilingual retrieval with one vector per document, achieving best-in-class accuracy for its size and inference speed comparable to smaller models. LFM2.5-ColBERT-350M is a late interaction retriever with best-in-class multilingual accuracy, enabling cross-lingual retrieval by storing one vector per token and supporting retrieval in multiple languages with high precision. Both models are designed as drop-in replacements for existing RAG pipelines.

media r/LocalLLaMA · 6d ago

The power of intelligence is better in the hands of the people than in the board rooms of tycoons

The PearlOS project has launched an open-source swarm intelligence platform that uses local models to handle multimodal tasks. It automatically selects and switches between top-performing models based on benchmarks, ensuring users always access the latest and most capable models without relying on closed-source systems or subscriptions.

media r/LocalLLaMA · 6d ago

Keye-VL-2.0-30B-A3B Launches with Advanced Video Understanding and Agent Capabilities

Keye-VL-2.0-30B-A3B is a 30B-parameter multimodal model designed for long-video understanding and agent functionality. It outperforms open-source rivals and matches Gemini-3-Flash in temporal grounding, supports up to 256K context with near-lossless reasoning, and includes built-in capabilities for code, tool, and web search agent workflows.

arxiv arXiv cs.LG · 7d ago

TGO-I: Spectral Geometry of Vision Transformers

TGO-I analyzes the spectral geometry of Vision Transformers using ViT-Small/16 trained on ImageNet-100. It reveals increasing dimensional utilization and reduced anisotropy, with eigenspectra becoming flatter and spectral entropy rising. The final CLS token shows highest effective dimensionality and lowest anisotropy, indicating broad variance distribution across dimensions.

arxiv arXiv cs.LG · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a single panoramic canvas using 3D world coordinates. It achieves state-of-the-art performance on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using significantly less training compute than existing methods.

arxiv arXiv cs.LG · 7d ago

Ambient Sound and Light Predict ICU Delirium

A study finds that ambient sound and light intensity can independently predict delirium in ICUs. Sound features were the dominant predictors, with combined sound and light improving short-term delirium risk estimation, especially within one week.

arxiv arXiv cs.AI · 7d ago

Clinician-Centered Pipeline for Ultrasound AI Annotation and Evaluation

A new pipeline enables clinicians to perform remote annotation and blinded evaluation of ultrasound AI models without local data downloads. It supports multi-rater participation, result aggregation, and automated statistical analysis, validated in a fetal ultrasound segmentation study with six raters of varying expertise. Results show moderate to strong agreement and a preference for later active learning models in blinded rankings.

arxiv arXiv cs.AI · 7d ago

Hardware-validated vision-in-the-loop for maritime UAV autonomy

A deep monocular pose estimator processes rendered maritime environments in real time, fused with IMU data via a delayed Kalman filter. The system enables autonomous indoor flight with perception latency and computational constraints, validating maritime UAV autonomy safely before shipboard deployment.

arxiv arXiv cs.AI · 7d ago

Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images

A new benchmark evaluates AI-generated text-rich images across six domains, including commercial posters and receipts. It reveals significant domain-dependent performance and sensitivity to JPEG compression, highlighting the need for text- and layout-aware detection methods.

arxiv arXiv cs.AI · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.

arxiv arXiv cs.CL · 7d ago

OmniAgent: Native Active Perception for Omni-Modal Understanding

OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).

arxiv arXiv cs.LG · 7d ago

Semantic Robustness Certification for Vision-Language Models

This work introduces a framework that certifies vision-language model robustness under semantic-level transformations, using text prompts as proxies. It quantifies extent intervals for which predictions remain unchanged, without requiring additional data for each variation. Experiments on synthetic and real-world data demonstrate its effectiveness across diverse semantic variations.

arxiv arXiv cs.LG · 7d ago

Inductive Biases in ML Emulation of Sudden Stratospheric Warmings

A study evaluates how architectural inductive biases affect machine learning emulators' ability to capture sudden stratospheric warming dynamics in idealized simulations. Results show that three-dimensional vertical coupling is a key bias, with model performance diverging significantly during active SSW-like variability. However, low forecast error does not ensure accurate wave-mean-flow interactions, as coherent errors persist in stratospheric wave-driving structure.