AI agents — korshunov.ai

AI agents Page 1 / 20

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

arxiv arXiv cs.CL · 2d ago

Measuring Genuine Emergent Consensus in LLM Agent Societies

A new metric, coupling gain gamma, measures how agents adjust opinions when neighbors' views are perturbed. It reveals that frontier LLMs do not spontaneously polarize, and a diagnostic of final versus initial opinion shows that claimed emergent consensus in prior work involves model artifacts. Valid consensus emerges only when group-level, modality-matched coupling is considered, not single-neighbour interactions.

lab OpenAI News · 2d ago

Omio builds AI-native conversational travel

Omio leverages OpenAI to enhance conversational travel experiences. The company uses AI to accelerate product development and transition into an AI-native business model.

arxiv arXiv cs.CL · 2d ago

PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL introduces a benchmark of 327 retail tasks across 1,665 tools to evaluate LLM agents' ability to iteratively retrieve and use tools in long-horizon planning. It includes a blocking mechanism simulating tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and adaptability.

arxiv arXiv cs.CL · 2d ago

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra introduces a neurosymbolic framework that combines LLM-based workflow orchestration with Datalog+/- symbolic reasoning. It enables adaptive, explainable decision-making by incrementally planning workflows and executing logical inference on demand, offering auditability, scalability, and verifiability in real-world financial scenarios.

arxiv arXiv cs.CL · 2d ago

MacAgentBench Launches macOS AI Agent Benchmark

MacAgentBench introduces a comprehensive benchmark with 676 tasks across 25 applications, 60% of which involve both GUI and CLI interactions. It uses deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves 73.7% Pass@1, primarily due to its skill library rather than framework design.

media r/LocalLLaMA · 2d ago

MCP servers consume context window via tool definitions

Each MCP server dumps its full tool list into the model's context before any prompt, using up to 24,000 tokens for 62 tools. A local gateway implementing lazy discovery reduces tool-definition overhead by 97%, cutting token usage from ~24k to ~660 per request, with 90% fewer total tokens over a task, without affecting task success rate.

arxiv arXiv cs.CL · 2d ago

LRE: Few-Kilobytes Agent Memory with Zero Neural Cost

LRE is a CPU-only, language-model-free system that learns which interaction history units are load-bearing. It outperforms baselines in accuracy-cost balance, reducing peak context size by up to 52% and improving task completion by 37% in some cases. LRE achieves superior answer quality with 68% fewer tokens and requires no annotations or neural computation for training.

arxiv arXiv cs.CL · 2d ago

Beaver: Agent Harness for Scientific Curation from Multimodal Sources

Beaver is an agent harness that extracts structured information from scientific papers by integrating multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. It achieves 81.0 on the Gold-Referenced Attribute Score, outperforming frontier agents by over 23 points, with key gains on high-value attributes requiring cross-modal reasoning.

arxiv arXiv cs.CL · 2d ago

AdaMem: Learning What to Remember for Personalized Long-Horizon LLM Agents

AdaMem learns what to remember for each user from feedback, reducing memory bloat and improving QA accuracy by up to 9.0% over uniform baselines while shrinking memory volume by 9%.

arxiv arXiv cs.CL · 2d ago

Dementia-Agents: Multi-Modal Multi-Agent System for Dementia Staging

Dementia-Agents introduces a clinically aligned multi-agent framework for real-world dementia staging and phenotyping. It improves diagnostic performance over monolithic models and prior systems, while maintaining domain-level interpretability, using data from 1,066 patients across two cognitive neurology services.

arxiv arXiv cs.CL · 2d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

media r/LocalLLaMA · 2d ago

Microsoft Releases Open Source FastContext for LLM Coding Agents

Microsoft has open-sourced FastContext-1.0, a lightweight repository-exploration subagent that separates code repository exploration from task solving in LLM coding agents. It uses parallel read-only tool calls to return compact file paths and line ranges, improving end-to-end accuracy and reducing token usage by up to 60.3%, with the 4B-RL model outperforming a 30B-SFT model on SWE-bench Pro.

media Latent Space · 2d ago

AI Red Teaming and Prompt Injection Risks Explained

Zico Kolter and Matt Fredrikson, co-authors of the definitive paper on indirect prompt injections and authorities on the Mythos model, discuss the growing risks of AI security. They highlight that AI systems require a distinct security mindset, with agents introducing new vulnerabilities, and that specialized red-teaming AI can outperform humans in breaking models, making AI prompt injection breaches increasingly likely.

lab Claude Code Releases · 2d ago

Claude v2.1.186 Release Notes

Claude v2.1.186 adds CLI authentication commands for MCP servers, status filtering in workflows, and a "Skills" section in plugin settings. It includes numerous bug fixes for UI, session management, and agent behavior, along with improvements to YAML parsing, memory handling, and tool validation.

media MarkTechPost · 2d ago

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration Model

Sakana AI has launched Sakana Fugu, an orchestration model that routes tasks across a swappable pool of frontier LLMs via a single OpenAI-compatible API. Fugu Ultra outperforms individual models on key benchmarks like SWE Bench Pro and GPQA-D, and the system demonstrates superior performance on complex, multi-step tasks such as auto-research, Rubik's Cube solving, and blindfold chess.

media r/LocalLLaMA · 3d ago

TMax: A Simple Recipe for Terminal Agents

TMax introduces TMax-15k, a dataset of 14,600 RL environments, over 2.5× larger than the next-largest open terminal dataset. It also presents a simple RL recipe that trains open models from 2B to 27B parameters, with TMax-9B achieving 27.2% on Terminal Bench 2.0 and TMax-27B reaching 42.7%.

media r/LocalLLaMA · 3d ago

Same model, same prompt, 4 different agents produce varied code quality

A self-hosted Qwen3.6-27B model with identical prompt and hardware generated four different HTML/JavaScript solar system simulations. The agent scaffolding significantly influenced output: opencode produced clean, stable code with accurate physics; pi showed robustness and coordinate consistency; hermes offered visually appealing but physically flawed results; qwen code generated minimal, crude code. The results highlight how agent design shapes code quality, correctness, and stability despite shared model and prompt.

media Interconnects · 3d ago

GLM-5.2 is the step change for open agents

GLM-5.2, an open-weight AI model released by Z.ai, has set a new benchmark in coding and general agent performance. It outperforms models like Claude Fable 5 and Gemini, and matches or exceeds OpenAI's Opus 4.8 in max thinking mode, establishing itself as the first open model that feels right in coding harnesses as a general agent.

media r/LocalLLaMA · 3d ago

I Built a Tool to Stop Manually Swapping Models on My 8GB GPU

I developed Prompt-Chain, a Streamlit app that chains a small Prompter model with a large Coder model into a single pipeline. It automatically swaps VRAM when transitioning from prompt refinement to code generation, eliminating manual model switching and reducing wasted tokens from poorly worded prompts.