AI agents — korshunov.ai

AI agents Page 1 / 20

EnterpriseClawBench: Real-World Agent Benchmark Released

EnterpriseClawBench is a benchmark built from real workplace sessions, featuring 852 reproducible tasks with detailed metadata. The best configuration achieves only 0.663 (Codex with GPT-5.5), highlighting the need for multi-dimensional evaluation of enterprise agents.

media r/LocalLLaMA · 2d ago

Is Sakana Fugu Just an IQ Experiment?

A Reddit post questions whether Sakana Fugu is merely an orchestration wrapper rather than a genuine AI model, suggesting it may be perceived as a mythos 5 killer due to misleading implications. The post raises concerns about users misinterpreting its capabilities.

arxiv arXiv cs.CL · 2d ago

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

arxiv arXiv cs.CL · 2d ago

Moshi-Face: Full-Duplex Dialogue with Facial Generation

Moshi-Face is the first full-duplex spoken dialogue model that jointly processes audio and facial input, generating both speech and synchronized facial motion. It uses a VQ-VAE face codec to encode and reconstruct 3D head meshes from facial videos into discrete face tokens, and a Face Transformer module to generate these tokens non-autoregressively for real-time audiovisual output. Experiments show Moshi-Face achieves audiovisual alignment with low latency while maintaining original dialogue quality.

arxiv arXiv cs.CL · 2d ago

CFAgentBench: Benchmark for Autonomous Construction-Finance Agents

CFAgentBench introduces a reproducible, self-hostable environment with 1,014 machine-gradeable tasks across eight domains, grounded in real-world sources. It features 40 oracle-validated tasks with executable evaluators that assess functional correctness via state diffs and output regexes, including a money-movement guard requiring human approval for payments. A key finding is that top agents lose 43% of successes when repeating tasks under temperature-0 decoding, indicating single-attempt performance does not reflect real-world deployability.

arxiv arXiv cs.CL · 2d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

arxiv arXiv cs.CL · 2d ago

Measuring Genuine Emergent Consensus in LLM Agent Societies

A new metric, coupling gain gamma, measures how agents adjust opinions when neighbors' views are perturbed. It reveals that frontier LLMs do not spontaneously polarize, and a diagnostic of final versus initial opinion shows that claimed emergent consensus in prior work involves model artifacts. Valid consensus emerges only when group-level, modality-matched coupling is considered, not single-neighbour interactions.

lab OpenAI News · 2d ago

Omio builds AI-native conversational travel

Omio leverages OpenAI to enhance conversational travel experiences. The company uses AI to accelerate product development and transition into an AI-native business model.

arxiv arXiv cs.CL · 2d ago

PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL introduces a benchmark of 327 retail tasks across 1,665 tools to evaluate LLM agents' ability to iteratively retrieve and use tools in long-horizon planning. It includes a blocking mechanism simulating tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and adaptability.

arxiv arXiv cs.CL · 2d ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering auditability, scalability, and verifiability in real-world financial scenarios.

arxiv arXiv cs.CL · 2d ago

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.

media r/LocalLLaMA · 2d ago

MCP servers consume context window via tool definitions

Each MCP server dumps its full tool list into the model's context before any prompt, using up to 24,000 tokens for 62 tools. A local gateway implementing lazy discovery reduces tool-definition overhead by 97%, cutting token usage from ~24k to ~660 per request, with 90% fewer total tokens over a task, without affecting task success rate.

arxiv arXiv cs.CL · 2d ago

LRE: Few-Kilobytes Agent Memory with Zero Neural Cost

LRE is a CPU-only, language-model-free system that learns which interaction history units are load-bearing. It outperforms baselines in accuracy-cost balance, reducing peak context size by up to 52% and improving task completion by 37% in some cases. LRE achieves superior answer quality with 68% fewer tokens and requires no annotations or neural computation for training.

arxiv arXiv cs.CL · 2d ago

Beaver: Agent Harness for Scientific Curation from Multimodal Sources

Beaver is an agent harness that extracts structured information from scientific papers by integrating multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. It achieves 81.0 on the Gold-Referenced Attribute Score, outperforming frontier agents by over 23 points, with key gains on high-value attributes requiring cross-modal reasoning.

arxiv arXiv cs.CL · 2d ago

AdaMem: Learning What to Remember for Personalized Long-Horizon LLM Agents

AdaMem learns what to remember for each user from feedback, reducing memory bloat and improving QA accuracy by up to 9.0% over uniform baselines while shrinking memory volume by 9%.

arxiv arXiv cs.CL · 2d ago

Dementia-Agents: Multi-Modal Multi-Agent System for Dementia Staging

Dementia-Agents introduces a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. It improves diagnostic performance over monolithic models and prior systems, while maintaining domain-level interpretability, using data from 1,066 patients across two cognitive neurology services.

arxiv arXiv cs.CL · 2d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

media r/LocalLLaMA · 2d ago

Microsoft Releases Open Source FastContext for LLM Coding Agents

Microsoft has open-sourced FastContext-1.0, a lightweight repository-exploration subagent that separates code repository exploration from task solving in LLM coding agents. It uses parallel read-only tool calls to return compact file paths and line ranges, improving end-to-end accuracy and reducing token usage by up to 60.3%, with the 4B-RL model outperforming a 30B-SFT model on SWE-bench Pro.

media Latent Space · 3d ago

AI Red Teaming and Prompt Injection Risks Explained

Zico Kolter and Matt Fredrikson, co-authors of the definitive paper on indirect prompt injections and authorities on the Mythos model, discuss the growing risks of AI security. They highlight that AI systems require a distinct security mindset, with agents introducing new vulnerabilities, and that specialized red-teaming AI can outperform humans in breaking models, making AI prompt injection breaches increasingly likely.

lab Claude Code Releases · 3d ago

Claude v2.1.186 Release Notes

Claude v2.1.186 adds CLI authentication commands for MCP servers, status filtering in workflows, and a "Skills" section in plugin settings. It includes numerous bug fixes for UI, session management, and agent behavior, along with improvements to YAML parsing, memory handling, and tool validation.