AI agents — korshunov.ai

AI agents Page 1 / 21

Argus Benchmark Evaluates Uncertainty Quantification Stability Across Vision-Language Models and GUI Grounding Datasets

The authors introduce Argus, a benchmark designed to evaluate post-hoc uncertainty quantification for computer-use agents that translate vision-language model predictions into executable GUI actions. The study assesses 28 open-weight methods across four VLM agents and four datasets, alongside eight closed-source methods from three vendors where internal model states are inaccessible. Key findings reveal selective transfer stability, where uncertainty rankings remain consistent across different datasets for a fixed model but degrade significantly when moving between different model classes or observable interfaces. Among open-weight options, hidden-state and density estimation techniques demonstrated the highest stability, while specific regimes favored sampling-based scores or verbalized self-assessment. Within-model ranking transfer proved strong with Spearman rho values up to 0.969, whereas cross-tier transfer to closed-source vendors averaged only +0.08. The research further indicates that conformal click regions shrink radii by 40-60 percent upon calibration but suffer coverage degradation under interface mismatch. To support regime-aware selection, the authors release per-item records, calibration splits, UQ scores, and analysis scripts.

arxiv arXiv cs.CL · 19h ago

ToolBench-X: Benchmarking Tool-Using Agents Under Unreliable Environments

The authors introduce ToolBench-X, a new benchmark designed to evaluate large language model agents under recoverable tool-environment unreliability. Unlike existing benchmarks that assume clean and stable environments, this framework injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. The dataset contains executable multi-step tasks across diverse domains with deterministic tools and canonical final answers for automatic evaluation. Crucially, every injected instance remains solvable through valid recovery paths such as retrying, fallback, or verification. Experiments reveal a substantial reliability gap where agents performing well with reliable tools often fail under these hazards. Further analysis indicates that failures stem from limited hazard diagnosis and ineffective recovery rather than tool-use volume or inference budget. Targeted recovery hints successfully recover many failed tasks, whereas test-time scaling yields more limited gains. These findings suggest that evaluation must shift focus from function-call accuracy to task completion in unreliable environments.

media r/LocalLLaMA · 21h ago

Colony: An Educational Simulation of LLM Attention Mechanisms Using Agent-Based Analogies

Colony is an educational resource designed to explain the attention mechanism of Large Language Models through simple analogies involving agents. The simulation places these agents within a board environment inspired by Conway's Game of Life. Each agent in the system represents a specific role within the self-attention block mechanism of an LLM. This visual approach allows users to observe how information flows and interacts during the attention process. The project is available as an open-source tool for those interested in exploring these concepts without complex mathematics. It serves as a fun and accessible way to understand the internal workings of transformer models.

lab Claude Code Releases · 22h ago

Claude Code v2.1.191 Release Notes

Claude Code version 2.1.191 introduces /rewind support, allowing users to resume conversations from before a /clear command was executed. The update fixes several critical issues, including background agents resurrecting after being stopped and scroll position jumping during streaming responses. It also corrects behavior where /voice displayed generic error messages and where /login URLs were truncated in Windows Terminal. Significant improvements enhance reliability for MCP servers by adding retry logic for transient network errors during capability discovery and OAuth flows. Headless environments now skip browser popups for OAuth, while sandbox network permissions are remembered for the session duration. Performance optimizations reduce CPU usage during streaming by approximately 37% through text update coalescing and mitigate long-session memory growth from the terminal output cache.

arxiv arXiv cs.AI · 1d ago

Hypothesis-Driven Skill Optimization for LLM Agents

HDSO enables safe, auditable skill updates for LLM agents without training, using falsifiable hypotheses and validation. On ALFWorld, it improves Qwen3-8B by +6.9 Avg. SR points and maintains a +7.1-point gain under noisy feedback, with validated skills transferable across runs and models when diagnostic alignment is achieved.

lab Google DeepMind Blog · 1d ago

Gemini 3.5 Flash Adds Computer Use Capability

Google has introduced computer use in Gemini 3.5 Flash, enabling the model to execute code and interact with external tools. This feature allows users to run programming tasks and access real-time information through integrated computing functions.

arxiv arXiv cs.AI · 1d ago

MetaPS: Adaptive Strategy Selection for Market Agents

MetaPS is a simulation-guided framework that enables market agents to adaptively select among programmatic strategies based on market states. It uses simulated markets to generate supervised training data, then selects strategies during inference to produce executable actions. Experiments show MetaPS outperforms fixed strategies and LLM-based agents, with compact models exceeding stronger API models in performance.

arxiv arXiv cs.AI · 1d ago

PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL evaluates long-horizon planning in LLM agents across 1,665 tools through 327 retail tasks. It introduces a blocking mechanism to simulate real-world tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and error handling.

arxiv arXiv cs.AI · 1d ago

Structural Codebase Index Improves Resolve Without Cost Penalty

A structural codebase index in coding agents enhances localization and resolve performance without increasing cost per cell. It outperforms agentic-grep baselines in both metrics and achieves lower cost per solved task, especially in workloads with multi-file changes.

arxiv arXiv cs.AI · 1d ago

Self-Evolving Cognitive Framework for Embodied Scientific Intelligence

The paper proposes a self-evolving cognitive framework that uses causal world modeling to enable embodied systems to continuously refine their internal models through interaction. It integrates causal modeling, intervention-driven reasoning, and continual refinement, redefining embodied interaction as an epistemic process for causal discovery and knowledge acquisition. The framework supports a shift from predictive to epistemic intelligence, with a new benchmark for evaluating self-evolving embodied scientific intelligence.

arxiv arXiv cs.AI · 1d ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering verifiable traces, auditability, and scalability over large datasets.

arxiv arXiv cs.AI · 1d ago

SCOPE: Self-Adaptive Symbolic Planning for Open-Ended Environments

SCOPE introduces a framework that refines action plans and evolves symbolic world models in open-ended environments. It combines a Symbolic Execution Simulator and a Self-Adaptive Symbolic Memory to improve plan completeness, perturbation resilience, and cross-task adaptability.

arxiv arXiv cs.AI · 1d ago

LLM-Orchestrated Agent for SOI Directional Coupler Design

A large language model orchestrates the design of a silicon-on-insulator 2x2 directional coupler by proposing gap values and assessing convergence. The design is validated through eigenmode and FDTD simulations on a common 2D effective-index model, showing a consistent phase offset of 2.837(11) micrometers that is corrected in a closed-loop process. The final device achieves a 50/50 split with a cross fraction of 0.498, within 0.0017 of the target.

lab Mistral AI News · 1d ago

New Connector Controls for Enterprise Security and Access

Mistral Studio now offers enriched admin controls to govern connector access per workspace and tool, enabling fine-grained permissions. Features include API keys with scopes, multi-account connectors, and a new Connectors Debugger for root cause analysis, all supporting secure, auditable integration with enterprise systems.

media Hugging Face Forums · 1d ago

Aiden Mobile Agent Prototype in the Making

Aiden is a physical AI agent device that monitors a phone's screen via HDMI and controls it through USB HID, enabling app automation without jailbreak or installed software. It supports bring-your-own LLMs, operates without backend infrastructure or data collection, and is released under the AGPL license as an open-source development board.

arxiv arXiv cs.AI · 1d ago

Grounded Scaling: Determinism as a Core Limit in Agentic AI

Agentic AI performance degrades exponentially in non-deterministic environments, with k-step success falling as δ^k when per-step determinism δ < 1. The paper introduces a framework linking environment determinism to task success, verifiability, and skill evolution, proposing a Supply Certainty Index and a five-level Determinism Maturity Model. It challenges prevailing views by identifying determinism as a binding constraint across compute, data, embodiment, and alignment.

arxiv arXiv cs.AI · 1d ago

Gazer: Training-Free Semantic Correction for Autoregressive Visual Models

Gazer introduces a training-free framework that uses multimodal large language model feedback to correct semantic errors in real time during autoregressive visual model generation. By integrating reflective diagnosis and semantic correction stages, Gazer improves compositional accuracy and semantic alignment across multiple models without additional training.

arxiv arXiv cs.AI · 1d ago

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.

media r/LocalLLaMA · 1d ago

Nex-N2-Mini-Ultra-Uncensored-Heretic Model Released

The Nex-N2-Mini-Ultra-Uncensored-Heretic model is now available, featuring agentic thinking with 5/100 refusals and a KLD of 0.0020. It is released in both Safetensors and GGUF formats and is accessible via Hugging Face. The creator notes that Heretic 1.2.0 was chosen over 1.4.0 due to better performance in avoiding high KLD and maintaining low refusal thresholds.

arxiv arXiv cs.AI · 1d ago

PaperClaw: Autonomous Research with Human-in-the-Loop Refinement

PaperClaw is a multi-agent system that autonomously conducts research from field selection to paper publication. It uses a validated, iterative propose-test-reflect loop, grounded in real references and runnable results, and supports human-in-the-loop refinement at any stage. Evaluation shows it produces strong papers both autonomously and with human oversight.