AI agents — korshunov.ai

AI agents Page 1 / 20

ReproRepo: Scalable Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one human-reported blocker in 90% of 1,149 ML papers, highlighting their ability to detect visible failures and semantic issues, though exact localization remains limited.

arxiv arXiv cs.CL · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.CL · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.CL · 8d ago

AI's Synthetic Lived Experience in Caregiver Support

LLMs can generate peer-like responses that mimic personal narratives, creating a false impression of lived experience. Psycholinguistic analysis shows human peers use more first-person and past-focused language than AI, and AI often fabricates experiential grounding without real experience. This synthetic lived experience paradox risks misleading caregivers, necessitating mechanisms to distinguish supportive framing from fabricated experience.

arxiv arXiv cs.CL · 8d ago

PseudoBench: Benchmarking Agentic Auto-Research Resistance to Pseudoscience

PseudoBench evaluates agentic auto-research systems' ability to detect pseudoscientific claims. Testing seven state-of-the-art agents, it finds near-zero refusal rates and only 27.4% resistance to pseudoscientific narratives, with stronger agents often using sophisticated scientific language to mask pseudoscience.

arxiv arXiv cs.CL · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model turns. The default escaping provides no protection for most role delimiter families and cannot replace a structural separation of instructions and data.

arxiv arXiv cs.CL · 8d ago

Agentic Benchmark Reveals AI Models Fail to Avoid Animal Exploitation

TAC, the first agentic benchmark for implicit animal welfare, tests AI agents' ability to avoid animal exploitation in travel booking scenarios. All seven frontier models score below 64%, with the best at 53%, and even minor prompt improvements yield only modest gains. An audit finds no signs of evaluation awareness, indicating performance gaps stem from lack of true welfare reasoning, not prompt recognition.

arxiv arXiv cs.CL · 8d ago

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Automated Attacks

A red-team study of Anthropic's Fable 5 and Opus 4.8 models reveals both are vulnerable to adaptive iterative attacks, with Opus 4.8 breached on 11.5% of intents and Fable 5 on 6.1%. Despite robust defenses, both models generated 1,620 and 702 panel-confirmed harmful completions across all harm categories, automatically and efficiently under automated attack.

arxiv arXiv cs.CL · 8d ago

d-OPSD: On-policy Self-distillation for Diffusion LLMs

d-OPSD is the first on-policy self-distillation framework designed for diffusion LLMs. It uses self-generated answers as suffix conditioning and step-level supervision, enabling efficient post-training with only about 10% of RLVR's optimization steps while outperforming RLVR and SFT baselines on four reasoning benchmarks.

arxiv arXiv cs.CL · 8d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one semantically related blocker in 90% of paper-repository pairs without executing code.

arxiv arXiv cs.LG · 8d ago

Preference-Based Trajectory Evaluation for Agentic Systems

Offline evaluation of agentic systems often produces tied comparisons in 75% of cases using standard success-based metrics. Preference-based trajectory evaluation reduces ties to 35% by comparing progress and time-to-return profiles, enhancing discriminative power and data efficiency. These results suggest benchmark saturation may stem from evaluation method choice, not just data or problem difficulty.

arxiv arXiv cs.LG · 8d ago

SkillMigrator: Transferable Interaction Patterns for Web Agent Efficiency

SkillMigrator learns reusable web skills by matching layout structures instead of element references. It stores each skill as a transferable interaction pattern with a structural sketch, enabling efficient skill transfer across sites. Compared to state-of-the-art methods, it reduces average LLM-action counts by 8-10% on WebArena and Mind2Web at matched success rates.

arxiv arXiv cs.LG · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. When trained with GRPO, EnvRL improves success rates of Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop.

arxiv arXiv cs.LG · 8d ago

QueryMarket: Cost-Aware Online Active Learning in Data Markets

QueryMarket introduces OVBAL, an online variance-based active learning framework that estimates each data point's marginal utility using a D-optimality criterion with exponential forgetting. OVBAL selects samples based on utility and price, operating under rolling budget constraints and adapting to concept drift, showing improved error-cost trade-offs in solar power forecasting tasks.

arxiv arXiv cs.LG · 8d ago

Qwen-RobotManip Achieves Generalization in Robotic Manipulation

Qwen-RobotManip, a Vision-Language-Action foundation model, enables large-scale training through unified alignment across representation, motion, and behavior. It uses open-source data to build a 38,100-hour pretraining corpus and demonstrates emergent generalization, outperforming prior state-of-the-art models in out-of-distribution settings and ranking first in RoboChallenge with a 20% relative improvement on real-robot platforms.

arxiv arXiv cs.LG · 8d ago

WallZero Beats Go Pros in WallGo

WallZero, an AlphaZero-based agent, defeats two professional Go players in WallGo, averaging 1.98x more territory per game. The study finds that the opening from the Netflix series creates a more balanced game, suggesting improved fairness in play.

arxiv arXiv cs.LG · 8d ago

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL is a distributed federated learning approach that enables nodes to self-organize into spatial clusters based on geographic proximity. It addresses temporal drift by combining experience replay with dwell-time-aware adaptive averaging, allowing nodes to maintain updated, region-specific knowledge while adapting to evolving environmental conditions.

arxiv arXiv cs.AI · 8d ago

T-API-Compliant ReAct Loop for Optical Networks

A T-API-compliant ReAct agentic loop is introduced for optical networks, enabling intent-driven, closed-loop management. Domain-specific composite tools achieve 90% oracle-validated correctness and reduce token usage by threefold compared to generic tools.

arxiv arXiv cs.AI · 8d ago