AI agents — korshunov.ai

AI agents Page 1 / 20

Unreliable Feedback Can Harm Tool-Using LLM Agents

Studies show that misleading feedback can cause LLM agents to perform worse than with no feedback at all. On HotpotQA, Qwen2.5-7B drops from 44.8 to 4.7 F1 under shuffled retrieval, despite clean tools. These results indicate that tool gains may be overstated and no-feedback controls are essential for valid evaluation.

arxiv arXiv cs.AI · 2d ago

AutoRAS: Learning Robust Agentic Systems with Primitive Representations

AutoRAS proposes a framework for automatically designing robust agentic systems by generating sequences of symbolic primitives that encode both structural connectivity and behavioral actions. It optimizes these sequences using safety signals from execution and flow-based objectives, achieving superior performance in both normal and adversarial conditions with minimal degradation under attacks.

arxiv arXiv cs.AI · 2d ago

CORTIS: Text-Only Adaptation of Spoken Language Models

CORTIS enables task-oriented voice agents to generate structured speech outputs by fine-tuning spoken language models using only text-form task supervision. It outperforms ASR-LLM cascades under acoustic degradation, especially in preserving high-level task semantics, without requiring paired speech-target annotations during training.

arxiv arXiv cs.AI · 2d ago

Decoupling Declarative and Procedural Knowledge in Vision-Language-Action Models

w$^{2}$VLA introduces a modular vision-language-action model that decouples declarative and procedural knowledge. By restructuring information flow, it enables robust behavior cloning and zero-shot skill transfer to novel, dissimilar objects.

arxiv arXiv cs.AI · 2d ago

Design-Time Verification of Agentic AI Workflows

A new approach verifies agentic AI workflows during design by modeling them as compositions of reusable building blocks. It applies twelve structural rules to ensure compatibility, reliably detecting design flaws even after structural transformations like task splitting.

arxiv arXiv cs.AI · 2d ago

Zero-shot Procedural Mistake Detection with VLMs

A unified zero-shot framework, ZeProM, uses a pre-trained Video-Language Model to jointly perform procedural mistake detection and temporal action segmentation. It achieves up to 4.4 point improvement in EDA and 2.0 point in F1@.5 on EgoPER tasks, matching or exceeding supervised methods without task-specific training.

media r/LocalLLaMA · 2d ago

MiniMax 2.7 Runs on 47TG 1200PP with 96GB VRAM

MiniMax 2.7, a 47 tera-parameter model, operates on a 96GB VRAM system with 192GB DDR5 RAM using an MSI B840 board and 9900X CPU. It runs as an agent-class model with strong instruction following and tool calling, supported by a round-robin loop with three CPU-based sequencing agents and a dense 12B model that monitors for errors.

lab Claude Code Releases · 2d ago

Claude v2.1.187 Release Notes

Claude v2.1.187 introduces sandbox credentials blocking, org-configured model restrictions, mouse click support in fullscreen, and fixes for command failures, tool hangs, and UI stability. Updates also improve structured output handling, agent depth tracking, and plugin management, with enhancements to VSCode and terminal compatibility.

media r/LocalLLaMA · 2d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

lab Anthropic News · 2d ago

Introducing Claude Tag for Slack Teams

Claude Tag allows teams to tag @Claude in Slack to delegate tasks, with access to selected channels, tools, and codebases. It learns from channel context, works asynchronously, and takes initiative by proactively updating users on relevant information. Today, 65% of Anthropic’s product team code is created by internal Claude Tag, and it’s now available in beta for Claude Enterprise and Team customers.

media r/LocalLLaMA · 2d ago

Reusable workflows for long-running local LLMs

Hayden has developed the knot harness to manage long-running local LLM tasks. It enables reusable workflows with agent profiles, file system event monitoring, and automatic triggers, using Pi.dev as the default agent.

media r/LocalLLaMA · 2d ago

Best local models for reasoning in agentic AI

The creator of EverFern asks which local models work best for agentic workflows and browser/computer use. They note that model intelligence is rarely the bottleneck, with reliability and recovery systems being more critical than model choice.

media r/LocalLLaMA · 2d ago

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.

arxiv arXiv cs.CL · 2d ago

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy combines real and mock app environments to train open models for phone use. It improves task success rates from 36.67% to 45.33% on real phones and from 60.3% to 83.2% on AndroidWorld, showing mock-app training complements but does not replace real-app RL.

arxiv arXiv cs.CL · 2d ago

Self-Evolution of Tool-Calling Agents via Divergence-Point Preference Learning

ToolGraph enhances multi-turn tool-using agents by integrating schema topology, transition weights, and history-aware controls. Training with DPO on 161 divergence-point preference pairs improves performance: ToolGraph+DPO achieves a 16.8% relative reward gain over baseline, especially in airline and retail tasks, with reward positivity emerging as the key diagnostic signal.

arxiv arXiv cs.CL · 2d ago

AFTER Benchmark Evaluates Procedural Memory in LLM Agents

AFTER introduces a benchmark of 382 enterprise tasks across six roles and 22 skills to assess skill transfer across tasks, roles, and models. Results show procedural memory improves performance by 3.7-6.7 points per refinement and achieves 73.1% cross-model accuracy, with some skills generalizing broadly and others specializing to role-specific workflows.

lab Hugging Face Blog · 2d ago

Build Real Agentic Apps with CUGA: 24 Working Examples

CUGA introduces a lightweight harness enabling developers to build real agentic applications. It includes 24 working examples demonstrating practical implementations across various use cases.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

arxiv arXiv cs.CL · 2d ago

Tmax: A Simple RL Recipe for Terminal Agents

Tmax presents the strongest open RL recipe for terminal agents, achieving 27% on Terminal-Bench 2.0 with only 9B parameters. It uses a novel data taxonomy to generate over 2.5x more terminal environments than prior datasets, enabling efficient training with a simple, outcome-only recipe. The dataset, models, and code are open-sourced at https://github.com/hamishivi/tmax.