AI agents — korshunov.ai

Topic · AI agents

Claude v2.1.178 introduces new permission rules using Tool(param:value) syntax, improved workflow and skill loading in nested directories, and enhanced auto mode and error messaging. It fixes critical issues including crashes, authentication errors, and UI behavior in Chrome and VSCode, while refining tool prompts and undo functionality.

arxiv arXiv cs.AI · 8d ago

TAC: First Agentic Benchmark for Animal Welfare in AI

TAC evaluates whether AI agents avoid animal exploitation in travel bookings. Seven frontier models all score below 64% chance level, with Claude Opus 4.7 at 53%. Adding a welfare-aware system prompt improves performance significantly, though models show no evidence of evaluation awareness in their responses.

arxiv arXiv cs.AI · 8d ago

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

WEQA introduces a query-adaptive agent framework that combines language models with specialized wearable data analysis tools. It outperforms LLM and agentic baselines by 24% in accuracy and demonstrates improved usefulness and clinical soundness in expert and user evaluations.

arxiv arXiv cs.AI · 8d ago

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

LEADS proposes a framework that uses an LLM agent to discover hybrid cardiac electrophysiology models through an iterative reasoning-and-action loop. It formulates domain knowledge as a structured action space, enabling physically grounded, interpretable, and numerically stable model designs, outperforming both human-designed and other LLM-based approaches on synthetic and real cardiac data.

arxiv arXiv cs.AI · 8d ago

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Adaptive Attacks

A red-team study of Anthropic's Fable 5 and Opus 4.8 models reveals both are vulnerable to adaptive iterative attacks, with Opus 4.8 breached on 11.5% of harmful intents and Fable -5 on 6.1%. Despite robust defenses, both models generated 1,620 and 702 panel-confirmed harmful completions across all harm categories, automatically and efficiently under automated attack.

arxiv arXiv cs.AI · 8d ago

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

VERITAS introduces a generator-verifier framework that enables robots to improve policies in real time without additional training. A visual verifier evaluates actions at inference time, allowing consistent performance gains through verified rollouts that serve as effective supervision for offline policy improvement. Post-training with these verified rollouts matches expert demonstrations in efficiency, without human intervention.

arxiv arXiv cs.CL · 8d ago

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

NarrativeWorldBench evaluates 21 LLMs on nine narrative-structure metrics across horizons of 10 to 200 episodes, with cross-lingual support in Hindi, Tamil, Telugu, and Marathi. N-VSSM, a latent world model using Mamba-2, achieves plot-beat F1 of at least 0.84 across all horizons with 4x lower compute than closed-frontier models and outperforms Claude Opus 4.5 in long-arc consistency and controllability in a professional writer study.

arxiv arXiv cs.CL · 8d ago

PARSE: Real-Document Defense for LLM Agents

PARSE reduces prompt injection attack success from 25.4% to 15.6% on real enterprise documents across five professional domains, with statistically significant improvement (p=0.014) and 86.9% utility. It outperforms paraphrasing and uses provenance-aware sanitization to preserve factual content while routing most documents through a lightweight path.

arxiv arXiv cs.CL · 8d ago

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

As enterprise agent tool catalogs scale from 10 to 110 agents, routing accuracy drops 16--23 percentage points on under-specified requests. An oracle analysis identifies retrieval and confusion gaps, with embedding-based shortlisting recovering +10--11pp F1. A human-annotated study of 1,435 utterances confirms real-world recovery of +10--17pp despite lower absolute performance.

arxiv arXiv cs.CL · 8d ago

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

OPD-Evolver introduces a slow-fast co-evolution framework that enables agents to select, act on, and reuse experience through on-policy self-distillation. It outperforms existing memory and training-based methods by up to 11.5% and 5.8% respectively, and demonstrates capability to challenge large-scale models like Qwen3.5-397B-A17B and Step-3.5-Flash.

arxiv arXiv cs.CL · 8d ago

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

SkillMigrator learns reusable web skills by matching layout structures instead of specific element references. It stores each skill as a transferable interaction pattern (TIP) with a structural sketch, enabling efficient skill reuse across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.CL · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. It achieves significant gains in success rates on long-horizon benchmarks, improving Qwen-2.5-1.5B-Instruct performance from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop when trained with GRPO.

arxiv arXiv cs.CL · 8d ago

LLM-Designed Training Environment for RL with Multi-Agent Reasoning

The LLM-as-Environment-Engineer framework uses LLMs to automatically redesign training environments in reinforcement learning by analyzing failure trajectories and contextual data. On the MAPF-FrozenLake testbed, it outperforms larger proprietary LLMs and fixed-environment baselines, with Qwen3-4B achieving the strongest aggregate performance. Analysis shows that failure evidence and preserved working configurations are key, and the current RL checkpoint performs better than the base model as an environment engineer.

arxiv arXiv cs.CL · 8d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 9d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 9d ago

DeepRubric: Efficient RL for Deep Research Agents

DeepRubric introduces a data construction framework that builds query-rubric pairs by first defining verifiable evaluation targets through an evidence tree. It generates 9K supervision examples and trains a 8B model with GRPO, achieving performance comparable to state-of-the-art models using 13x fewer RL GPU-hours.

arxiv arXiv cs.CL · 9d ago

KVEraser: Efficient Localized Context Erasing in LLMs

KVEraser enables efficient localized context erasing in large language models by replacing only the KV cache states of an erased span with learned steering states. It achieves near-full-recomputation performance on in-domain tasks across 1K to 32K context lengths, with only a 24% latency increase, and outperforms other approximate methods in long-document QA with 3--4x speedup over full recomputation.

arxiv arXiv cs.CL · 9d ago

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

MetaSyn introduces a dataset of 442 expert-curated meta-analyses from Nature Portfolio. It evaluates twelve LLM agent configurations and reveals a critical bottleneck in study screening, where no system recovers more than 52.7% of ground-truth included literature despite high retrieval recall.

arxiv arXiv cs.CL · 9d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

Claude v2.1.178 Release Notes

TAC: First Agentic Benchmark for Animal Welfare in AI

WEQA: Wearable Health Question Answering with Query-Adaptive Agentic Reasoning

LEADS: Agentic Discovery of Hybrid Models for Cardiac Electrophysiology

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Adaptive Attacks

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

NarrativeWorldBench and N-VSSM for Long-Horizon Audio Drama

PARSE: Real-Document Defense for LLM Agents

Routing Accuracy Degradation and Recovery in Enterprise Agent Systems

OPD-Evolver: On-Policy Distillation for Holistic Agent Evolving

SkillMigrator Enables Cross-Site Web Skill Transfer via Layout Matching

EnvRL: Leveraging Environment Dynamics in Agentic RL

LLM-Designed Training Environment for RL with Multi-Agent Reasoning

Automated Prompt Optimization for LLM Game Agents

LOGOS: A General-Purpose Generative Model for Natural Sciences

TokenPilot: Cache-Efficient Context Management for LLM Agents

DeepRubric: Efficient RL for Deep Research Agents

KVEraser: Efficient Localized Context Erasing in LLMs

MetaSyn: Benchmarking LLM Agents on Meta-Analysis Articles

ContextRL: Context-Aware RL for LLMs