AI agents — korshunov.ai

AI agents Page 1 / 20

Qwen Releases Qwen-AgentWorld-397B-A17B Model

Qwen has announced a new large language model called Qwen-AgentWorld-397B-A17B. The model is mentioned on Hugging Face and Qwen's official blog, indicating its public release and availability for use.

media r/LocalLLaMA · 1d ago

GitHub Repository: Qwen-AgentWorld for Language World Models

Qwen-AgentWorld is a GitHub repository introducing language world models designed for general-purpose agents. The project aims to enable agents with broader, more realistic world understanding through language-based modeling.

media r/LocalLLaMA · 1d ago

Qwen releases 35B-parameter MoE for agent environment simulation

Qwen has launched Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with only about 3B active parameters per token. It is trained to simulate responses from MCP, terminal, software engineering, Android, web, and OS GUI environments by predicting next observations after agent actions, enabling efficient agent training and environment simulation without real tool execution.

arxiv arXiv cs.CL · 1d ago

Are We Ready For An Agent-Native Memory System?

A new study decomposes agent memory into four core modules and evaluates 12 systems across five benchmark workloads. It finds no single architecture dominates, with performance dependent on alignment with workload bottlenecks, and reveals that localized maintenance is more cost-efficient than global reorganization.

arxiv arXiv cs.CL · 1d ago

Micro-Transaction Markets for Verified Product Info in Agentic E-Commerce

Autonomous agents in e-commerce face a scarcity of trustworthy product information, not product matching. A proposed micro-transaction model allows agents to pay fractions of a cent to access verified data like service histories and test reports, with pricing and trust scored via reputation. This system prioritizes genuine product quality and real-time information acquisition over chatbot fluency.

arxiv arXiv cs.CL · 1d ago

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

SHERLOC introduces a training-free framework that pairs a reasoning LLM with compact repository tools and self-recovery. It achieves state-of-the-art localization accuracy and recall on SWE-Bench, improving repair agents' resolve rate by 5.95 percentage points while reducing localization and total token usage by 36.7% and 23.1% respectively.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

arxiv arXiv cs.CL · 1d ago

MedBench v5: Dynamic Benchmark for Clinical AI

MedBench v5 introduces a dynamic, process-oriented benchmark for clinical multimodal models, featuring clinical cognitive responsiveness and atomic skills across 63 tasks. It includes stressors for degradation analysis and monitors hallucination propagation through five reasoning nodes, revealing that strong task performance does not ensure process stability.

arxiv arXiv cs.CL · 1d ago

Agon: Autonomous Research System via Prompt Economy

Agon is an autonomous research system that uses prompt economy to validate checkable claims in workflows, leaving judgment to human scientists. It operates across 444 iterations with minimal prompts and no human-written code, revealing a taxonomy of failures by severity, fixability, visibility, and capability locus. The system demonstrates scalability and advances research toward a paradigm where machines handle scale and humans guide judgment.

arxiv arXiv cs.CL · 1d ago

Dialogue to Discovery: Attribute-Aware Preference Elicitation

Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.

arxiv arXiv cs.CL · 1d ago

EDV Framework Enables Reliable Experience Learning for Agentic Systems

The EDV framework introduces an Execute-Distill-Verify paradigm to overcome the self-confirmation trap in large language model agents. By using multiple agents to explore tasks, a third-party agent to distill experiences, and a consensus-based verification step, EDV ensures only accurate experiences are stored in memory. Evaluation on tau2-bench, Mind2Web, and MMTB shows EDV outperforms strong baselines, demonstrating its effectiveness in enabling robust agent self-evolution.

arxiv arXiv cs.CL · 1d ago

AGORA: Benchmark for Agentic Workplace Document Reasoning

Agora introduces a benchmark with 362 questions and 9,664 authentic workplace documents totaling 372M tokens, exceeding any model's context window. It evaluates agents' ability to explore documents deliberately, reconcile inconsistencies, and reason across domains, revealing that even top models achieve only 59.4% accuracy.

arxiv arXiv cs.CL · 1d ago

NatureBench Evaluates AI Coding Agents' Scientific Discovery Capabilities

NatureBench presents a benchmark of 90 tasks from Nature-family papers to assess AI coding agents' ability to achieve scientific discovery. Under a web-search-disabled protocol, the top model exceeds prior state-of-the-art on only 17.8% of tasks. Agents primarily succeed by translating scientific problems into supervised learning tasks, not through original scientific invention.

arxiv arXiv cs.CL · 1d ago

MEMPROBE: Benchmark for Long-Term Memory Recovery in Agents

MEMPROBE is a benchmark that evaluates long-term memory in AI agents by reconstructing a user's hidden state from the agent's memory after interaction. It tests 5 memory systems across 50 simulated users with 31 dimensions each, finding that task completion is high even for memoryless agents, while memory recovery remains moderate and drops under top-k retrieval. MEMPROBE enables direct, auditable assessment of memory retention and proposes recovery as a key objective for future agent development.

arxiv arXiv cs.CL · 1d ago

Qwen-AgentWorld: Language World Models for General Agents

Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models that simulate agentic environments across seven domains using long chain-of-thought reasoning. Trained via a three-stage pipeline—CPT, SFT, and RL—these models outperform existing frontier models on AgentWorldBench, a benchmark derived from real-world interactions of five models on nine established tasks.

arxiv arXiv cs.LG · 1d ago

Distilling Transformers into Recurrent Transformers for Efficient Memory

A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.

github CrewAI · 2d ago

CrewAI 1.14.8a3 Release Notes

CrewAI 1.14.8a3 introduces unified declarative flow loading and improved startup UX for crew runs. It consolidates crewai run and flow kickoff commands, adds declarative Flow CLI support, and enables @router() as a flow start method with typed output schemas for tools.

arxiv arXiv cs.AI · 2d ago

FleetAgent: Efficient Teleoperation for Autonomous Fleets

FleetAgent is a cloud-hosted multimodal large language model that processes compact vectorized vehicle-to-network messages to enable efficient, explainable teleoperation. It reduces uplink payload by up to 625 times and KV-cache memory by 625 times compared to raw images or text, and outperforms Qwen2.5-VL-7B on Lingo-Judge and intervention failure rates on the VecEval dataset.

arxiv arXiv cs.AI · 2d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

arxiv arXiv cs.AI · 2d ago

Social World Model for Lifelong Social Intelligence

The Social World Model decomposes social interaction into five dimensions to enable closed-loop learning. It allows open-source models to sustainably improve and retain social capabilities, outperforming baselines and matching closed-source Gemini 3 Flash in key metrics without forgetting across difficulty levels.