AI agents — korshunov.ai

AI agents Page 1 / 21

SFT or RL-first for Qwen 3.5 Tool Agent Training?

A user asks whether supervised fine-tuning (SFT) followed by reinforcement learning (RL) is still recommended for training Qwen 3.5 4B or 9B agents for multi-tool use, or if RL-only approaches yield better results. The post also seeks guidance on reward design and handling parallel tool execution in agent workflows.

arxiv arXiv cs.CL · 2d ago

Group-Graph Policy Optimization for Long-Horizon Agentic RL

Group-Graph Policy Optimization (G2PO) introduces a graph-based approach to enhance long-horizon agentic reinforcement learning by transforming interaction trajectories into state-transition graphs. It enables group-aggregated state-value estimation and edge-centric advantage calculation, improving credit assignment and reducing variance, and achieves up to 22.2% success rate improvement over GRPO on WebShop, ALFWorld, and AppWorld benchmarks.

arxiv arXiv cs.CL · 2d ago

PhoneBuddy: Training Open Models for Agentic Phone Use

PhoneBuddy combines real and mock app environments to train open models for phone use. It improves task success rates from 36.67% to 45.33% on real phones and from 60.3% to 83.2% on AndroidWorld, showing mock-app training complements but does not replace real-app RL.

arxiv arXiv cs.CL · 2d ago

Self-Evolution of Tool-Calling Agents via Divergence-Point Preference Learning

ToolGraph enhances multi-turn tool-using agents by integrating schema topology, transition weights, and history-aware controls. Training with DPO on 161 divergence-point preference pairs improves performance: ToolGraph+DPO achieves a 16.8% relative reward gain over baseline, especially in airline and retail tasks, with reward positivity emerging as the key diagnostic signal.

arxiv arXiv cs.CL · 2d ago

AFTER Benchmark Evaluates Procedural Memory in LLM Agents

AFTER introduces a benchmark of 382 enterprise tasks across six roles and 22 skills to assess skill transfer across tasks, roles, and models. Results show procedural memory improves performance by 3.7-6.7 points per refinement and achieves 73.1% cross-model accuracy, with some skills generalizing broadly and others specializing to role-specific workflows.

lab Hugging Face Blog · 2d ago

Build Real Agentic Apps with CUGA: 24 Working Examples

CUGA introduces a lightweight harness enabling developers to build real agentic applications. It includes 24 working examples demonstrating practical implementations across various use cases.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

arxiv arXiv cs.CL · 2d ago

Tmax: A Simple RL Recipe for Terminal Agents

Tmax presents the strongest open RL recipe for terminal agents, achieving 27% on Terminal-Bench 2.0 with only 9B parameters. It uses a novel data taxonomy to generate over 2.5x more terminal environments than prior datasets, enabling efficient training with a simple, outcome-only recipe. The dataset, models, and code are open-sourced at https://github.com/hamishivi/tmax.

arxiv arXiv cs.CL · 2d ago

SelfCompact: Self-Driving Context Compaction for Language Models

SelfCompact enables language models to autonomously decide when and how to compact accumulated context during reasoning. By combining a model-invoked summarization tool with a lightweight rubric that guides compaction based on trajectory structure, it achieves effective adaptive compaction without fine-tuning. Results show it matches or exceeds fixed-interval methods on math and agentic search benchmarks, improving baselines by up to 18.1 points on math and 5-9 points on search, at 30-70% lower token cost.

arxiv arXiv cs.CL · 2d ago

EnterpriseClawBench: Real-World Agent Benchmark Released

EnterpriseClawBench is a benchmark built from real workplace sessions, featuring 852 reproducible tasks with detailed metadata. The best configuration achieves only 0.663 (Codex with GPT-5.5), highlighting the need for multi-dimensional evaluation of enterprise agents.

media r/LocalLLaMA · 2d ago

Is Sakana Fugu Just an IQ Experiment?

A Reddit post questions whether Sakana Fugu is merely an orchestration wrapper rather than a genuine AI model, suggesting it may be perceived as a mythos 5 killer due to misleading implications. The post raises concerns about users misinterpreting its capabilities.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

arxiv arXiv cs.CL · 2d ago

Moshi-Face: Full-Duplex Dialogue with Facial Generation

Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial input, generating both speech and synchronized facial motion. It uses a VQ-VAE face codec to encode and reconstruct 3D head meshes from facial videos into discrete face tokens, and a Face Transformer module to generate these tokens non-autoregressively for real-time audiovisual output. Experiments show Moshi-Face achieves audiovisual alignment with low latency while maintaining original dialogue quality.

arxiv arXiv cs.CL · 2d ago

CFAgentBench: Benchmark for Autonomous Construction-Finance Agents

CFAgentBench introduces a reproducible, self-hostable environment with 1,014 machine-gradeable tasks across eight domains, grounded in real-world sources. It features 40 oracle-validated tasks with executable evaluators that assess functional correctness via state diffs and output regexes, including a money-movement guard requiring human approval for payments. A key finding is that top agents lose 43% of successes when repeating tasks under temperature-0 decoding, indicating single-attempt performance does not reflect real-world deployability.

arxiv arXiv cs.CL · 2d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

arxiv arXiv cs.CL · 2d ago

Measuring Genuine Emergent Consensus in LLM Agent Societies

A new metric, coupling gain gamma, measures how agents adjust opinions when neighbors' views are perturbed. It reveals that frontier LLMs do not spontaneously polarize, and a diagnostic of final versus initial opinion shows that claimed emergent consensus in prior work involves model artifacts. Valid consensus emerges only when group-level, modality-matched coupling is considered, not single-neighbour interactions.

lab OpenAI News · 2d ago

Omio builds AI-native conversational travel

Omio leverages OpenAI to enhance conversational travel experiences. The company uses AI to accelerate product development and transition into an AI-native business model.

arxiv arXiv cs.CL · 3d ago

PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL introduces a benchmark of 327 retail tasks across 1,665 tools to evaluate LLM agents' ability to iteratively retrieve and use tools in long-horizon planning. It includes a blocking mechanism simulating tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and adaptability.

arxiv arXiv cs.CL · 3d ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering auditability, scalability, and verifiability in real-world financial scenarios.

arxiv arXiv cs.CL · 3d ago

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.