Mistral AI — korshunov.ai

Lab · Mistral AI

ViGOS introduces a visually grounded on-policy self-distillation framework for multimodal large language models. It decouples perception and reasoning by using an image-only teacher for visual descriptions and a reasoning teacher for final outputs, reducing reliance on text-only references. This approach improves image-grounded performance across multiple vision-language benchmarks.

arxiv arXiv cs.CL · 7d ago

HandwritingAgent: Language-Driven Handwriting Synthesis in SVG

HandwritingAgent synthesizes natural handwriting in SVG format without style-specific training. It uses a large reasoning model to generate stroke sequences in a grid canvas, conditioned on text input and a reference style image, enabling efficient, controllable, and generalizable handwriting generation.

arxiv arXiv cs.AI · 7d ago

Skill-Guided Continuation Distillation for GUI Agents

SGCD introduces an iterative framework to improve GUI agents by addressing supervision gaps in off-trajectory states. It extracts skills from both successful and failed rollouts, using them to guide policy continuations that are mixed with expert trajectories. On OSWorld-Verified, SGCD boosts success rates of three base models from low-30\% to over 50\%.

arxiv arXiv cs.AI · 7d ago

ThinkDeception: Interpretable Multimodal Deception Detection Framework

ThinkDeception introduces a progressive reinforcement learning framework that enables interpretable multimodal deception detection. It leverages a step-by-step annotated Chain of Thought dataset and proposes Visual-Audio Consistency Group Relative Policy Optimization with a dynamic curriculum, enhancing reasoning quality and outperforming existing methods on mainstream benchmarks.

arxiv arXiv cs.AI · 7d ago

AdsMind: Physics-Grounded Multi-Agent System for Adsorption Discovery

AdsMind is a closed-loop multi-agent system that uses machine learning force fields and feedback to correct errors in adsorption configuration searches on catalyst surfaces. It achieves 100% and 98.8% success rates on AA20 and OCD-GMAE62 benchmarks, reduces energy dispersion by 14-fold compared to baselines, and maintains correct adsorption-energy signs in DFT validation, outperforming open-loop LLM agents.

arxiv arXiv cs.LG · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.CL · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.CL · 8d ago

Geographic Bias in Large Language Models from User Metadata

A study reveals that even neutral prompts trigger region-specific responses in large language models due to user metadata. Location leakage increases by up to 793 times in some models, and using 'Unknown' instead of location metadata still causes significant bias, indicating the user profile frame itself acts as a conditioning signal.

arxiv arXiv cs.CL · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context sensitivity, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.LG · 8d ago

SkillMigrator: Transferable Interaction Patterns for Web Agent Efficiency

SkillMigrator learns reusable web skills by matching layout structures instead of element references. It stores each skill as a transferable interaction pattern with a structural sketch, enabling efficient skill transfer across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.AI · 8d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 8d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.CL · 8d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

A study challenges the assumption that visual attention signals reliability in vision-language models. It finds near-zero correlation between spatial attention and accuracy, showing instead that self-consistency across reasoning paths is a stronger predictor of truth. Reliability is better explained by generation dynamics and internal state distributions, not visual attention patterns.

arxiv arXiv cs.CL · 8d ago

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

NarrativeWorldBench evaluates 21 LLMs on nine narrative-structure metrics across horizons of 10 to 200 episodes, with cross-lingual support in Hindi, Tamil, Telugu, and Marathi. N-VSSM, a latent world model using Mamba-2, achieves plot-beat F1 of at least 0.84 across all horizons with 4x lower compute than closed-frontier models and outperforms Claude Opus 4.5 in long-arc consistency and controllability in a professional writer study.

arxiv arXiv cs.CL · 8d ago

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

OPD-Evolver introduces a slow-fast co-evolution framework that enables agents to select, act on, and reuse experience through on-policy self-distillation. It outperforms existing memory and training-based methods by up to 11.5% and 5.8% respectively, and demonstrates capability to challenge large-scale models like Qwen3.5-397B-A17B and Step-3.5-Flash.

arxiv arXiv cs.CL · 8d ago

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

SkillMigrator learns reusable web skills by matching layout structures instead of specific element references. It stores each skill as a transferable interaction pattern (TIP) with a structural sketch, enabling efficient skill reuse across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.CL · 8d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 8d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

ViGOS: Decoupling Perception and Reasoning in Multimodal On-Policy Self-Distillation

HandwritingAgent: Language-Driven Handwriting Synthesis in SVG

Skill-Guided Continuation Distillation for GUI Agents

ThinkDeception: Interpretable Multimodal Deception Detection Framework

AdsMind: Physics-Grounded Multi-Agent System for Adsorption Discovery

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens: Auditing Hallucinations in Legal AI

SkillWeaver: Compositional Skill Routing for LLM Agents

Geographic Bias in Large Language Models from User Metadata

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

SkillMigrator: Transferable Interaction Patterns for Web Agent Efficiency

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

MambaCount: Efficient Text-guided Object Counting

Automated Prompt Optimization for LLM Game Agents