AI agents — korshunov.ai

AI agents Page 1 / 21

Introducing Claude Tag for Slack Teams

Claude Tag allows teams to tag @Claude in Slack to delegate tasks, with access to selected channels, tools, and codebases. It learns from channel context, works asynchronously, and takes initiative by proactively updating users on relevant information. Today, 65% of Anthropic’s product team code is created by internal Claude Tag, and it’s now available in beta for Claude Enterprise and Team customers.

media r/LocalLLaMA · 2d ago

Reusable workflows for long-running local LLMs

Hayden has developed the knot harness to manage long-running local LLM tasks. It enables reusable workflows with agent profiles, file system event monitoring, and automatic triggers, using Pi.dev as the default agent.

media r/LocalLLaMA · 2d ago

Best local models for reasoning in agentic AI

The creator of EverFern asks which local models work best for agentic workflows and browser/computer use. They note that model intelligence is rarely the bottleneck, with reliability and recovery systems being more critical than model choice.

media r/LocalLLaMA · 2d ago

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.

arxiv arXiv cs.CL · 2d ago

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy combines real and mock app environments to train open models for phone use. It improves task success rates from 36.67% to 45.33% on real phones and from 60.3% to 83.2% on AndroidWorld, showing mock-app training complements but does not replace real-app RL.

arxiv arXiv cs.CL · 2d ago

Self-Evolution of Tool-Calling Agents via Divergence-Point Preference Learning

ToolGraph enhances multi-turn tool-using agents by integrating schema topology, transition weights, and history-aware controls. Training with DPO on 161 divergence-point preference pairs improves performance: ToolGraph+DPO achieves a 16.8% relative reward gain over baseline, especially in airline and retail tasks, with reward positivity emerging as the key diagnostic signal.

arxiv arXiv cs.CL · 2d ago

AFTER Benchmark Evaluates Procedural Memory in LLM Agents

AFTER introduces a benchmark of 382 enterprise tasks across six roles and 22 skills to assess skill transfer across tasks, roles, and models. Results show procedural memory improves performance by 3.7-6.7 points per refinement and achieves 73.1% cross-model accuracy, with some skills generalizing broadly and others specializing to role-specific workflows.

lab Hugging Face Blog · 2d ago

Build Real Agentic Apps with CUGA: 24 Working Examples

CUGA introduces a lightweight harness enabling developers to build real agentic applications. It includes 24 working examples demonstrating practical implementations across various use cases.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

arxiv arXiv cs.CL · 2d ago

Tmax: A Simple RL Recipe for Terminal Agents

Tmax presents the strongest open RL recipe for terminal agents, achieving 27% on Terminal-Bench 2.0 with only 9B parameters. It uses a novel data taxonomy to generate over 2.5x more terminal environments than prior datasets, enabling efficient training with a simple, outcome-only recipe. The dataset, models, and code are open-sourced at https://github.com/hamishivi/tmax.

arxiv arXiv cs.CL · 2d ago

SelfCompact: Self-Driving Context Compaction for Language Models

SelfCompact enables language models to autonomously decide when and how to compact accumulated context during reasoning. By combining a model-invoked summarization tool with a lightweight rubric that guides compaction based on trajectory structure, it achieves effective adaptive compaction without fine-tuning. Results show it matches or exceeds fixed-interval methods on math and agentic search benchmarks, improving baselines by up to 18.1 points on math and 5-9 points on search, at 30-70% lower token cost.

arxiv arXiv cs.CL · 2d ago

EnterpriseClawBench: Real-World Agent Benchmark Released

EnterpriseClawBench is a benchmark built from real workplace sessions, featuring 852 reproducible tasks with detailed metadata. The best configuration achieves only 0.663 (Codex with GPT-5.5), highlighting the need for multi-dimensional evaluation of enterprise agents.

media r/LocalLLaMA · 2d ago

Is Sakana Fugu Just an IQ Experiment?

A Reddit post questions whether Sakana Fugu is merely an orchestration wrapper rather than a genuine AI model, suggesting it may be perceived as a mythos 5 killer due to misleading implications. The post raises concerns about users misinterpreting its capabilities.

arxiv arXiv cs.CL · 3d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

arxiv arXiv cs.CL · 3d ago

Moshi-Face: Full-Duplex Dialogue with Facial Generation

Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial input, generating both speech and synchronized facial motion. It uses a VQ-VAE face codec to encode and reconstruct 3D head meshes from facial videos into discrete face tokens, and a Face Transformer module to generate these tokens non-autoregressively for real-time audiovisual output. Experiments show Moshi-Face achieves audiovisual alignment with low latency while maintaining original dialogue quality.

arxiv arXiv cs.CL · 3d ago

CFAgentBench: Benchmark for Autonomous Construction-Finance Agents

CFAgentBench introduces a reproducible, self-hostable environment with 1,014 machine-gradeable tasks across eight domains, grounded in real-world sources. It features 40 oracle-validated tasks with executable evaluators that assess functional correctness via state diffs and output regexes, including a money-movement guard requiring human approval for payments. A key finding is that top agents lose 43% of successes when repeating tasks under temperature-0 decoding, indicating single-attempt performance does not reflect real-world deployability.

arxiv arXiv cs.CL · 3d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

arxiv arXiv cs.CL · 3d ago

Measuring Genuine Emergent Consensus in LLM Agent Societies

A new metric, coupling gain gamma, measures how agents adjust opinions when neighbors' views are perturbed. It reveals that frontier LLMs do not spontaneously polarize, and a diagnostic of final versus initial opinion shows that claimed emergent consensus in prior work involves model artifacts. Valid consensus emerges only when group-level, modality-matched coupling is considered, not single-neighbour interactions.

lab OpenAI News · 3d ago

Omio builds AI-native conversational travel

Omio leverages OpenAI to enhance conversational travel experiences. The company uses AI to accelerate product development and transition into an AI-native business model.