AI agents — korshunov.ai

AI agents Page 1 / 20

Design-Time Verification of Agentic AI Workflows

A new approach verifies agentic AI workflows during design by modeling them as compositions of reusable building blocks. It applies twelve structural rules to ensure compatibility, reliably detecting design flaws even after structural transformations like task splitting.

arxiv arXiv cs.AI · 1d ago

Zero-shot Procedural Mistake Detection with VLMs

A unified zero-shot framework, ZeProM, uses a pre-trained Video-Language Model to jointly perform procedural mistake detection and temporal action segmentation. It achieves up to 4.4 point improvement in EDA and 2.0 point in F1@.5 on EgoPER tasks, matching or exceeding supervised methods without task-specific training.

media r/LocalLLaMA · 1d ago

MiniMax 2.7 Runs on 47TG 1200PP with 96GB VRAM

MiniMax 2.7, a 47 tera-parameter model, operates on a 96GB VRAM system with 192GB DDR5 RAM using an MSI B840 board and 9900X CPU. It runs as an agent-class model with strong instruction following and tool calling, supported by a round-robin loop with three CPU-based sequencing agents and a dense 12B model that monitors for errors.

lab Claude Code Releases · 1d ago

Claude v2.1.187 Release Notes

Claude v2.1.187 introduces sandbox credentials blocking, org-configured model restrictions, mouse click support in fullscreen, and fixes for command failures, tool hangs, and UI stability. Updates also improve structured output handling, agent depth tracking, and plugin management, with enhancements to VSCode and terminal compatibility.

media r/LocalLLaMA · 1d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

lab Anthropic News · 2d ago

Introducing Claude Tag for Slack Teams

Claude Tag allows teams to tag @Claude in Slack to delegate tasks, with access to selected channels, tools, and codebases. It learns from channel context, works asynchronously, and takes initiative by proactively updating users on relevant information. Today, 65% of Anthropic’s product team code is created by internal Claude Tag, and it’s now available in beta for Claude Enterprise and Team customers.

media r/LocalLLaMA · 2d ago

Reusable workflows for long-running local LLMs

Hayden has developed the knot harness to manage long-running local LLM tasks. It enables reusable workflows with agent profiles, file system event monitoring, and automatic triggers, using Pi.dev as the default agent.

media r/LocalLLaMA · 2d ago

Best local models for reasoning in agentic AI

The creator of EverFern asks which local models work best for agentic workflows and browser/computer use. They note that model intelligence is rarely the bottleneck, with reliability and recovery systems being more critical than model choice.

media r/LocalLLaMA · 2d ago

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.

arxiv arXiv cs.CL · 2d ago

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy combines real and mock app environments to train open models for phone use. It improves task success rates from 36.67% to 45.33% on real phones and from 60.3% to 83.2% on AndroidWorld, showing mock-app training complements but does not replace real-app RL.

arxiv arXiv cs.CL · 2d ago

Self-Evolution of Tool-Calling Agents via Divergence-Point Preference Learning

ToolGraph enhances multi-turn tool-using agents by integrating schema topology, transition weights, and history-aware controls. Training with DPO on 161 divergence-point preference pairs improves performance: ToolGraph+DPO achieves a 16.8% relative reward gain over baseline, especially in airline and retail tasks, with reward positivity emerging as the key diagnostic signal.

arxiv arXiv cs.CL · 2d ago

AFTER Benchmark Evaluates Procedural Memory in LLM Agents

AFTER introduces a benchmark of 382 enterprise tasks across six roles and 22 skills to assess skill transfer across tasks, roles, and models. Results show procedural memory improves performance by 3.7-6.7 points per refinement and achieves 73.1% cross-model accuracy, with some skills generalizing broadly and others specializing to role-specific workflows.

lab Hugging Face Blog · 2d ago

Build Real Agentic Apps with CUGA: 24 Working Examples

CUGA introduces a lightweight harness enabling developers to build real agentic applications. It includes 24 working examples demonstrating practical implementations across various use cases.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

arxiv arXiv cs.CL · 2d ago

Tmax: A Simple RL Recipe for Terminal Agents

Tmax presents the strongest open RL recipe for terminal agents, achieving 27% on Terminal-Bench 2.0 with only 9B parameters. It uses a novel data taxonomy to generate over 2.5x more terminal environments than prior datasets, enabling efficient training with a simple, outcome-only recipe. The dataset, models, and code are open-sourced at https://github.com/hamishivi/tmax.

arxiv arXiv cs.CL · 2d ago

SelfCompact: Self-Driving Context Compaction for Language Models

SelfCompact enables language models to autonomously decide when and how to compact accumulated context during reasoning. By combining a model-invoked summarization tool with a lightweight rubric that guides compaction based on trajectory structure, it achieves effective adaptive compaction without fine-tuning. Results show it matches or exceeds fixed-interval methods on math and agentic search benchmarks, improving baselines by up to 18.1 points on math and 5-9 points on search, at 30-70% lower token cost.

arxiv arXiv cs.CL · 2d ago

EnterpriseClawBench: Real-World Agent Benchmark Released

EnterpriseClawBench is a benchmark built from real workplace sessions, featuring 852 reproducible tasks with detailed metadata. The best configuration achieves only 0.663 (Codex with GPT-5.5), highlighting the need for multi-dimensional evaluation of enterprise agents.

media r/LocalLLaMA · 2d ago

Is Sakana Fugu Just an IQ Experiment?

A Reddit post questions whether Sakana Fugu is merely an orchestration wrapper rather than a genuine AI model, suggesting it may be perceived as a mythos 5 killer due to misleading implications. The post raises concerns about users misinterpreting its capabilities.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.