AI agents — korshunov.ai

AI agents Page 1 / 20

MEMPROBE: Benchmark for Long-Term Memory Recovery in Agents

MEMPROBE is a benchmark that evaluates long-term memory in AI agents by reconstructing a user's hidden state from the agent's memory after interaction. It tests 5 memory systems across 50 simulated users with 31 dimensions each, finding that task completion is high even for memoryless agents, while memory recovery remains moderate and drops under top-k retrieval. MEMPROBE enables direct, auditable assessment of memory retention and proposes recovery as a key objective for future agent development.

arxiv arXiv cs.CL · 1d ago

Qwen-AgentWorld: Language World Models for General Agents

Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models that simulate agentic environments across seven domains using long chain-of-thought reasoning. Trained via a three-stage pipeline—CPT, SFT, and RL—these models outperform existing frontier models on AgentWorldBench, a benchmark derived from real-world interactions of five models on nine established tasks.

arxiv arXiv cs.LG · 1d ago

Distilling Transformers into Recurrent Transformers for Efficient Memory

A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.

github CrewAI · 1d ago

CrewAI 1.14.8a3 Release Notes

CrewAI 1.14.8a3 introduces unified declarative flow loading and improved startup UX for crew runs. It consolidates crewai run and flow kickoff commands, adds declarative Flow CLI support, and enables @router() as a flow start method with typed output schemas for tools.

arxiv arXiv cs.AI · 1d ago

FleetAgent: Efficient Teleoperation for Autonomous Fleets

FleetAgent is a cloud-hosted multimodal large language model that processes compact vectorized vehicle-to-network messages to enable efficient, explainable teleoperation. It reduces uplink payload by up to 625 times and KV-cache memory by 625 times compared to raw images or text, and outperforms Qwen2.5-VL-7B on Lingo-Judge and intervention failure rates on the VecEval dataset.

arxiv arXiv cs.AI · 1d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

arxiv arXiv cs.AI · 1d ago

Social World Model for Lifelong Social Intelligence

The Social World Model decomposes social interaction into five dimensions to enable closed-loop learning. It allows open-source models to sustainably improve and retain social capabilities, outperforming baselines and matching closed-source Gemini 3 Flash in key metrics without forgetting across difficulty levels.

arxiv arXiv cs.AI · 1d ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining multimodal data to align with user and downstream intents. It uses a two-stage pipeline with factual anchors to generate a large-scale dataset across five domains and achieves strong alignment via supervised fine-tuning and GRPO. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.AI · 1d ago

LLM-Agent Oversight Must Shift from Calibration to Action-Conditioned Control

Current oversight of LLM agents relies on scalar risk scores, but this fails to capture whether an intervention improves outcomes. The paper introduces "intervention advantage" as the key metric, showing that action-conditioned control outperforms scalar routing across benchmarks, with significant regret reduction in interactive regimes. Calibration alone does not resolve the underlying mismatch in control performance.

arxiv arXiv cs.AI · 1d ago

SwarmX: Agentic Scheduling for Low-Latency Systems

SwarmX introduces neural predictors to enable prompt-aware scheduling in agentic AI systems. It reduces tail latency by up to 61.5% and maintains up to 2x the throughput of production schedulers under the same service level objectives.

arxiv arXiv cs.AI · 1d ago

Unreliable Feedback Can Harm Tool-Using LLM Agents

Studies show that misleading feedback can cause LLM agents to perform worse than with no feedback at all. On HotpotQA, Qwen2.5-7B drops from 44.8 to 4.7 F1 under shuffled retrieval, despite clean tools. These results indicate that tool gains may be overstated and no-feedback controls are essential for valid evaluation.

arxiv arXiv cs.AI · 1d ago

AutoRAS: Learning Robust Agentic Systems with Primitive Representations

AutoRAS proposes a framework for automatically designing robust agentic systems by generating sequences of symbolic primitives that encode both structural connectivity and behavioral actions. It optimizes these sequences using safety signals from execution and flow-based objectives, achieving superior performance in both normal and adversarial conditions with minimal degradation under attacks.

arxiv arXiv cs.AI · 1d ago

CORTIS: Text-Only Adaptation of Spoken Language Models

CORTIS enables task-oriented voice agents to generate structured speech outputs by fine-tuning spoken language models using only text-form task supervision. It outperforms ASR-LLM cascades under acoustic degradation, especially in preserving high-level task semantics, without requiring paired speech-target annotations during training.

arxiv arXiv cs.AI · 1d ago

Decoupling Declarative and Procedural Knowledge in Vision-Language-Action Models

w$^{2}$VLA introduces a modular vision-language-action model that decouples declarative and procedural knowledge. By restructuring information flow, it enables robust behavior cloning and zero-shot skill transfer to novel, dissimilar objects.

arxiv arXiv cs.AI · 1d ago

Design-Time Verification of Agentic AI Workflows

A new approach verifies agentic AI workflows during design by modeling them as compositions of reusable building blocks. It applies twelve structural rules to ensure compatibility, reliably detecting design flaws even after structural transformations like task splitting.

arxiv arXiv cs.AI · 1d ago

Zero-shot Procedural Mistake Detection with VLMs

A unified zero-shot framework, ZeProM, uses a pre-trained Video-Language Model to jointly perform procedural mistake detection and temporal action segmentation. It achieves up to 4.4 point improvement in EDA and 2.0 point in F1@.5 on EgoPER tasks, matching or exceeding supervised methods without task-specific training.

media r/LocalLLaMA · 1d ago

MiniMax 2.7 Runs on 47TG 1200PP with 96GB VRAM

MiniMax 2.7, a 47 tera-parameter model, operates on a 96GB VRAM system with 192GB DDR5 RAM using an MSI B840 board and 9900X CPU. It runs as an agent-class model with strong instruction following and tool calling, supported by a round-robin loop with three CPU-based sequencing agents and a dense 12B model that monitors for errors.

lab Claude Code Releases · 1d ago

Claude v2.1.187 Release Notes

Claude v2.1.187 introduces sandbox credentials blocking, org-configured model restrictions, mouse click support in fullscreen, and fixes for command failures, tool hangs, and UI stability. Updates also improve structured output handling, agent depth tracking, and plugin management, with enhancements to VSCode and terminal compatibility.

media r/LocalLLaMA · 1d ago

Tmax-27B Terminal Agent for Small GPUs with DPPO Training

Tmax-27B is a terminal agent based on Qwen3.6-27B, trained with DPPO (RL), achieving 43% on Terminal Bench 2.0 and 69% on TB Lite. To run on consumer GPUs, it is quantized using importance-matrix-calibrated GGUF models from 2 to 5 bits per weight, with a grafted MTP head enabling speculative decoding. IQ2_XS at 8.5 GiB achieves 70% pass rate in agentic coding tasks, outperforming plain quantization and demonstrating stable tool-call generation.

lab Anthropic News · 1d ago

Introducing Claude Tag for Slack Teams

Claude Tag allows teams to tag @Claude in Slack to delegate tasks, with access to selected channels, tools, and codebases. It learns from channel context, works asynchronously, and takes initiative by proactively updating users on relevant information. Today, 65% of Anthropic’s product team code is created by internal Claude Tag, and it’s now available in beta for Claude Enterprise and Team customers.