Google DeepMind — korshunov.ai

Lab · Google DeepMind

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context sensitivity, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.LG · 8d ago

Reversal Q-Learning: A New Off-Policy RL Algorithm

Reversal Q-Learning (RQL) is a new off-policy reinforcement learning algorithm that trains a flow policy using prior data. By modeling flow refinement steps as actions in an expanded Markov decision process and applying virtual on-policy trajectories via reversal, RQL enables effective offline learning without backpropagation through time. Experiments on 50 robotic tasks show RQL achieves the best average performance among state-of-the-art flow-based offline RL methods.

arxiv arXiv cs.LG · 8d ago

SkillMigrator: Transferable Interaction Patterns for Web Agent Efficiency

SkillMigrator learns reusable web skills by matching layout structures instead of element references. It stores each skill as a transferable interaction pattern with a structural sketch, enabling efficient skill transfer across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.LG · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.LG · 8d ago

Lightweight Experiential Latent Memories for Continual Self-Improvement

A new method enables large language models to learn from their own reasoning traces without external supervision. By distilling inference-time computation into lightweight, modular latent memories, the model achieves performance competitive with full training and outperforms zero-shot and raw ICL baselines on mathematical reasoning tasks, with minimal computational overhead.

arxiv arXiv cs.AI · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.AI · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.AI · 8d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 8d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.AI · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context degradation detection, and model performance gains of up to 66% on HealthBench.

arxiv arXiv cs.CL · 8d ago

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

NarrativeWorldBench evaluates 21 LLMs on nine narrative-structure metrics across horizons of 10 to 200 episodes, with cross-lingual support in Hindi, Tamil, Telugu, and Marathi. N-VSSM, a latent world model using Mamba-2, achieves plot-beat F1 of at least 0.84 across all horizons with 4x lower compute than closed-frontier models and outperforms Claude Opus 4.5 in long-arc consistency and controllability in a professional writer study.

arxiv arXiv cs.CL · 8d ago

PARSE: Real-Document Defense for LLM Agents

PARSE reduces prompt injection attack success from 25.4% to 15.6% on real enterprise documents across five professional domains, with statistically significant improvement (p=0.014) and 86.9% utility. It outperforms paraphrasing and uses provenance-aware sanitization to preserve factual content while routing most documents through a lightweight path.

arxiv arXiv cs.CL · 8d ago

STATEWITNESS: Activation Explainer for Deception Auditing in LLMs

STATEWITNESS introduces an activation explainer that audits deception in reasoning LLMs by reading hidden states and generating natural-language answers or structured reports. It achieves a 0.916 mean AUROC, outperforming existing black-box monitors and activation probes by 11.6% and 25.0% respectively, and provides query-level, schema, and evidence-level traces for human inspection.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

arxiv arXiv cs.CL · 8d ago

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

SkillMigrator learns reusable web skills by matching layout structures instead of specific element references. It stores each skill as a transferable interaction pattern (TIP) with a structural sketch, enabling efficient skill reuse across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.CL · 8d ago

MambaCount: Efficient Text-guided Object Counting

MambaCount introduces a spatial sparse state space duality block to enable efficient text-guided open-vocabulary object counting. It addresses causal modeling limitations and high entropy in spatial token responses, achieving state-of-the-art results on FSC-147 with a test MAE of 12.23 while maintaining linear complexity.

arxiv arXiv cs.CL · 8d ago

LLM-Designed Training Environment for RL with Multi-Agent Reasoning

The LLM-as-Environment-Engineer framework uses LLMs to automatically redesign training environments in reinforcement learning by analyzing failure trajectories and contextual data. On the MAPF-FrozenLake testbed, it outperforms larger proprietary LLMs and fixed-environment baselines, with Qwen3-4B achieving the strongest aggregate performance. Analysis shows that failure evidence and preserved working configurations are key, and the current RL checkpoint performs better than the base model as an environment engineer.

arxiv arXiv cs.CL · 8d ago

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo introduces Minimal Sufficient CoT (MSC) as the shortest reasoning prefix adequate for correct answers. It employs a two-stage training framework—MSC-Aligned Fine-Tuning and Sufficiency-Aware Policy Optimization—to reduce reasoning length while maintaining or improving accuracy across math, code, and science tasks.

arxiv arXiv cs.CL · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.