Multimodal — korshunov.ai

Multimodal Page 3 / 8

ViGiL3D++ Enables Diverse Language Generation for 3D Visual Grounding

ViGiL3D++ introduces a scalable, scene-agnostic method that generates diverse visual grounding queries by combining constraint sampling in scene graphs with large language model language generation. It outperforms existing models on multiple 3D visual grounding benchmarks and reveals key limitations of current vision-language models.

arxiv arXiv cs.CL · 2d ago

IPA-Based Tokenization Improves Multilingual Language Model Performance

A new approach uses the International Phonetic Alphabet to create language-agnostic tokenizers for multilingual models. Training matched text and IPA subword tokenizers across 24 languages and 14 scripts shows IPA tokenizers enhance tokenization quality, particularly for non-Latin scripts, and generalize better to unseen languages and scripts.

arxiv arXiv cs.CL · 2d ago

Beaver: Agent Harness for Scientific Curation from Multimodal Sources

Beaver is an agent harness that extracts structured information from scientific papers by integrating multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. It achieves 81.0 on the Gold-Referenced Attribute Score, outperforming frontier agents by over 23 points, with key gains on high-value attributes requiring cross-modal reasoning.

arxiv arXiv cs.CL · 2d ago

Dementia-Agents: Multi-Modal Multi-Agent System for Dementia Staging

Dementia-Agents introduces a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. It improves diagnostic performance over monolithic models and prior systems, while maintaining domain-level interpretability, using data from 1,066 patients across two cognitive neurology services.

arxiv arXiv cs.CL · 2d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.CL · 2d ago

CAT-Translate: Compact Japanese-English Models Outperform Multilingual Ones in Real-World Tasks

CAT-Translate introduces a family of small, open-source models specialized for Japanese-English translation. Using synthetic parallel corpora and a two-stage fine-tuning approach, the models achieve superior performance on real-world benchmarks across business, legal, medical, financial, and patent domains, outperforming large multilingual models in practical applications.

media r/LocalLLaMA · 3d ago

Gemma4-12B-QAT Uncensored Balanced Released with 60% Speed Boost via MTP

The Gemma4-12B-QAT Uncensored Balanced model is now available, featuring a 60% speed improvement through multi-token-prediction (MTP) speculative decoding. It includes Q4_K_M quantization, vision support via mmproj, and stable generation with no looping or context drift, making it ideal for creative writing and emotional intelligence tasks.

media r/LocalLLaMA · 3d ago

Updated Vision Model Benchmark Results and Recommendations

A revised benchmark of local vision language models evaluates 23 models across 30 images with 3 tests each, totaling 2,070 tests and 60 to 70 inference hours. The top-performing model is Qwen3.6 27B (nothink) at Q4 with a 79.6 score, followed by Qwen3.5 4B (nothink) at Q4, and Qwen3-VL 8B at Q8. Key findings include thinking mode degrading vision performance, MoE models underperforming compared to dense models, and Q8 quantization not universally improving results.

lab NVIDIA Technical Blog · 4d ago

NVIDIA Launches XR AI for AR Glasses and Wearable Devices

NVIDIA introduces XR AI to bridge the infrastructure gap for developers building AI experiences on AR glasses and XR devices. The solution enables integration of live sensor streams, multimodal AI models, and enterprise data within device-specific runtimes, streamlining AI agent development for wearables.

media r/LocalLLaMA · 4d ago

AllenAI releases MolmoMotion vision models for future motion prediction

AllenAI has released two MolmoMotion models that predict 3D point trajectories based on short video histories and natural-language instructions. One model uses a three-frame history, the other a one-frame history, enabling future motion forecasting for objects in 3D space.

media r/LocalLLaMA · 4d ago

SupraLabs Launches Any2Any Model Family

SupraLabs has introduced the Supra-A2A-Nano-Exp model, a 30M-parameter multimodal Transformer that unifies text, image, and video into a single token stream. The model treats all modalities as tokens in a shared sequence, enabling language modeling over a combined vocabulary of 50,520 tokens without separate vision encoders or cross-attention modules.

media r/LocalLLaMA · 5d ago

Research Project: Injecting Natural-Language Tactical Intent into Multi-Agent Football Policies

A research project explores using natural-language tactical instructions from humans to guide autonomous AI agents in a football simulation. The system enables human coaches to issue high-level directives like 'press aggressively' or 'exploit the left side', which the AI agents then adapt to in real time within a dynamic, team-based environment.

media r/LocalLLaMA · 5d ago

SupraLabs Releases SupraVL-Nano-900k Vision-Language Model

SupraLabs has launched SupraVL-Nano-900k, a fully transparent, 900k-parameter vision-language model trained from scratch on Flickr8k. It features a CNN visual encoder, GPT-2-style decoder, and prefix concatenation fusion, with all components openly documented and designed for educational clarity.

media r/LocalLLaMA · 5d ago

Commission selects EUROPA consortium as winner of Frontier AI Grande Challenge

The European Commission has chosen the EUROPA consortium, led by Domyn, to develop an open-source frontier AI model in all 24 EU languages. The project, launched in February 2026, aims to create a model with over 400 billion parameters, showcasing Europe's capacity to build advanced AI on its own infrastructure.

arxiv arXiv cs.AI · 6d ago

SARLO-80: VHR SAR-Optical-Text Dataset Released

SARLO-80 is a large-scale dataset combining very-high-resolution SAR SLC, aligned optical imagery, and natural-language descriptions. It includes 119,566 triplets from 2,500 global scenes across 72 countries, standardized to an 80cm slant-range grid with pixel-level alignment and three caption variants. The dataset is publicly available on Hugging Face for multimodal learning benchmarks in native SAR geometry.

arxiv arXiv cs.LG · 6d ago

FedMGS: Federated Modality-aware Graph Synthesis for Imbalanced MultiModal Learning

FedMGS addresses client- and node-level modality imbalance in federated graph learning by synthesizing latent semantic representations. It integrates an availability-aware graph encoder, prototype-guided semantic synthesizer, and reliability-calibrated fusion mechanism to recover missing modalities while preserving semantic alignment. Experiments show FedMGS achieves up to 17.41% performance gains over baselines across four tasks.

arxiv arXiv cs.LG · 6d ago

RefRad2D Dataset Enables Scalable Spatial Grounding in Radiology

RefRad2D is a large-scale bilingual dataset of 1.2M CT and MR image-text pairs from clinical practice. Trained on this data, RadGrounder achieves competitive results in VQA and report generation while maintaining language quality through spatial grounding supervision without performance degradation.

arxiv arXiv cs.LG · 6d ago

UNIEGO: Proxy-Mediated Unified Egocentric Video Representation

UNIEGO introduces a hierarchical multi-teacher distillation framework that uses proxy models to mediate knowledge transfer from nine diverse teachers across viewpoints and modalities. The Selective Proxy Distillation (SPD) stage adaptively selects reliable proxies during training, improving representation quality and stability. UNIEGO achieves state-of-the-art results in action recognition, video retrieval, and action segmentation on ego-exo benchmarks.

arxiv arXiv cs.CL · 6d ago

RefRad2D Dataset Enables Scalable Spatial Grounding in Radiology

RefRad2D is a large-scale bilingual dataset of 1.2M CT and MR image-text pairs from clinical practice. Trained on this data, RadGrounder achieves competitive VQA results and performs spatial grounding without degrading language quality, enabling verifiable outputs in radiology.

arxiv arXiv cs.CL · 6d ago

StylisticBias: Visual Cues Drive Most Social Biases in MLLMs

StylisticBias introduces a controlled benchmark to evaluate attribute-level social bias in multimodal large language models. It reveals that age and body type dominate identity-level effects, while fashion style and 15 key visual attributes drive most bias, accounting for nearly 80% of variation. The benchmark highlights that model judgments are most sensitive to appearance-related cues, especially in socioeconomic and style-based contexts.