AI agents — korshunov.ai

AI agents Page 1 / 20

Preference-Based Trajectory Evaluation for Agentic Systems

Offline evaluation of agentic systems often produces tied comparisons in 75% of cases using standard success-based metrics. Preference-based trajectory evaluation reduces ties to 35% by comparing progress and time-to-return profiles, enhancing discriminative power and data efficiency. These results suggest benchmark saturation may stem from evaluation method choice, not just data or problem difficulty.

arxiv arXiv cs.LG · 8d ago

SkillMigrator: Transferable Interaction Patterns for Web Agent Efficiency

SkillMigrator learns reusable web skills by matching layout structures instead of element references. It stores each skill as a transferable interaction pattern with a structural sketch, enabling efficient skill transfer across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.LG · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. When trained with GRPO, EnvRL improves success rates of Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop.

arxiv arXiv cs.LG · 8d ago

QueryMarket: Cost-Aware Online Active Learning in Data Markets

QueryMarket introduces OVBAL, an online variance-based active learning framework that estimates each data point's marginal utility using a D-optimality criterion with exponential forgetting. OVBAL selects samples based on utility and price, operating under rolling budget constraints and adapting to concept drift, showing improved error-cost trade-offs in solar power forecasting tasks.

arxiv arXiv cs.LG · 8d ago

Qwen-RobotManip Achieves Generalization in Robotic Manipulation

Qwen-RobotManip, a Vision-Language-Action foundation model, enables large-scale training through unified alignment across representation, motion, and behavior. It uses open-source data to build a 38,100-hour pretraining corpus and demonstrates emergent generalization, outperforming prior state-of-the-art models in out-of-distribution settings and ranking first in RoboChallenge with a 20% relative improvement on real-robot platforms.

arxiv arXiv cs.LG · 8d ago

WallZero Beats Go Pros in WallGo

WallZero, an AlphaZero-based agent, defeats two professional Go players in WallGo, averaging 1.98x more territory per game. The study finds that the opening from the Netflix series creates a more balanced game, suggesting improved fairness in play.

arxiv arXiv cs.LG · 8d ago

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL is a distributed federated learning approach that enables nodes to self-organize into spatial clusters based on geographic proximity. It addresses temporal drift by combining experience replay with dwell-time-aware adaptive averaging, allowing nodes to maintain updated, region-specific knowledge while adapting to evolving environmental conditions.

arxiv arXiv cs.AI · 8d ago

T-API-Compliant ReAct Loop for Optical Networks

A T-API-compliant ReAct agentic loop is introduced for optical networks, enabling intent-driven, closed-loop management. Domain-specific composite tools achieve 90% oracle-validated correctness and reduce token usage by threefold compared to generic tools.

arxiv arXiv cs.AI · 8d ago

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

arxiv arXiv cs.AI · 8d ago

LLM Consumer Behavior Theory: A New Research Field

This paper introduces LLM Consumer Behavior Theory, a new field analyzing how large language models make consumption decisions on behalf of users. It unifies research on LLM decision-making, human behavior simulation, and preference elicitation under economic principles, identifying key gaps in assumptions like rationality and heterogeneity in agentic markets.

arxiv arXiv cs.AI · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.AI · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.AI · 8d ago

AI's Synthetic Lived Experience in Caregiver Support

LLMs can generate peer-like responses that mimic personal narratives, creating a false impression of lived experience. Psycholinguistic analysis shows AI uses less first-person and past-focused language than human peers, and often fabricates experiential grounding. This reveals a narrative authenticity gap, requiring AI systems to distinguish supportive framing from fabricated lived experience.

arxiv arXiv cs.AI · 8d ago

PseudoBench: Benchmarking Agentic Auto-Research Resistance to Pseudoscience

PseudoBench evaluates agentic auto-research systems' ability to detect pseudoscientific claims. Testing seven state-of-the-art agents, it finds near-zero refusal rates and only 27.4% resistance to pseudoscientific narratives. Current systems often present pseudoscience in credible scientific language, highlighting a critical risk to scientific integrity.

arxiv arXiv cs.AI · 8d ago

Agentic AI Framework Reduces Diagnostic Errors in Healthcare

A multi-agent AI framework addresses premature diagnostic handoff and silent hallucinations in healthcare by enforcing structured clinical protocol completion and epistemic uncertainty quantification. Evaluations on 150 simulated cases show 49.3% diagnostic precision, an 11.3 percentage point improvement over baseline, with a statistically significant negative correlation between OLDCARTS completeness and diagnostic uncertainty.

arxiv arXiv cs.AI · 8d ago

EAGG: Embodiment-Aligned Grasp Generation via Geometry-Aware Graph Conditioning

EAGG introduces a grasp generator that aligns embodiment structure within a shared model using topology-aware graphs and geometry-aware tokens. It achieves 56.17% average grasp success on MultiGripperGrasp, matching specialized models within 1.10 percentage points and reducing median contact distance from 0.239 cm to 0.189 cm.

arxiv arXiv cs.AI · 8d ago

ALeRCE Launches Text-to-SQL System with LLMs

The ALeRCE astronomical database introduces a text-to-SQL system using large language models, enabling natural language queries to generate executable SQL. The system, evaluated on 110 NL/SQL pairs, uses a step-by-step framework that outperforms direct-inference baselines, with Claude Opus 4.6 achieving high precision on simple queries and among the best overall performance across evaluated models.

arxiv arXiv cs.AI · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model turns. The default escaping provides no protection for most delimiter families and cannot replace a structural separation of instruction and data.

arxiv arXiv cs.AI · 8d ago

Meta-Knowledge Reutilization in Reinforcement Learning

A new framework learns task-level knowledge on a simplified agent and transfers it to heterogeneous agents. It uses Bayesian non-parametric priors and a high-level policy to generate task guidance, with a semantic-magnitude interface and temporal adaptor to align meta-knowledge with embodiment-specific controllers. Experiments show 94.75% to 99.79% reduction in final-step tracking error and comparable performance using 23.8% of the interaction data of state-of-the-art methods.

arxiv arXiv cs.AI · 8d ago

TAC: First Agentic Benchmark for Animal Welfare in AI

TAC evaluates whether AI agents avoid animal exploitation in travel bookings. Seven frontier models all score below 64% chance level, with Claude Opus 4.7 at 53%. Adding a welfare-aware system prompt improves performance significantly, though models show no evidence of evaluation awareness in their responses.