Qwen Releases Qwen-AgentWorld-397B-A17B Model
Qwen has announced a new large language model called Qwen-AgentWorld-397B-A17B. The model is mentioned on Hugging Face and Qwen's official blog, indicating its public release and availability for use.
Qwen has announced a new large language model called Qwen-AgentWorld-397B-A17B. The model is mentioned on Hugging Face and Qwen's official blog, indicating its public release and availability for use.
Qwen-AgentWorld is a GitHub repository introducing language world models designed for general-purpose agents. The project aims to enable agents with broader, more realistic world understanding through language-based modeling.
Qwen has launched Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with only about 3B active parameters per token. It is trained to simulate responses from MCP, terminal, software engineering, Android, web, and OS GUI environments by predicting next observations after agent actions, enabling efficient agent training and environment simulation without real tool execution.
A new study decomposes agent memory into four core modules and evaluates 12 systems across five benchmark workloads. It finds no single architecture dominates, with performance dependent on alignment with workload bottlenecks, and reveals that localized maintenance is more cost-efficient than global reorganization.
Autonomous agents in e-commerce face a scarcity of trustworthy product information, not product matching. A proposed micro-transaction model allows agents to pay fractions of a cent to access verified data like service histories and test reports, with pricing and trust scored via reputation. This system prioritizes genuine product quality and real-time information acquisition over chatbot fluency.
SHERLOC introduces a training-free framework that pairs a reasoning LLM with compact repository tools and self-recovery. It achieves state-of-the-art localization accuracy and recall on SWE-Bench, improving repair agents' resolve rate by 5.95 percentage points while reducing localization and total token usage by 36.7% and 23.1% respectively.
Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.
MedBench v5 introduces a dynamic, process-oriented benchmark for clinical multimodal models, featuring clinical cognitive responsiveness and atomic skills across 63 tasks. It includes stressors for degradation analysis and monitors hallucination propagation through five reasoning nodes, revealing that strong task performance does not ensure process stability.
Agon is an autonomous research system that uses prompt economy to validate checkable claims in workflows, leaving judgment to human scientists. It operates across 444 iterations with minimal prompts and no human-written code, revealing a taxonomy of failures by severity, fixability, visibility, and capability locus. The system demonstrates scalability and advances research toward a paradigm where machines handle scale and humans guide judgment.
Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.
The EDV framework introduces an Execute-Distill-Verify paradigm to overcome the self-confirmation trap in large language model agents. By using multiple agents to explore tasks, a third-party agent to distill experiences, and a consensus-based verification step, EDV ensures only accurate experiences are stored in memory. Evaluation on tau2-bench, Mind2Web, and MMTB shows EDV outperforms strong baselines, demonstrating its effectiveness in enabling robust agent self-evolution.
Agora introduces a benchmark with 362 questions and 9,664 authentic workplace documents totaling 372M tokens, exceeding any model's context window. It evaluates agents' ability to explore documents deliberately, reconcile inconsistencies, and reason across domains, revealing that even top models achieve only 59.4% accuracy.
NatureBench presents a benchmark of 90 tasks from Nature-family papers to assess AI coding agents' ability to achieve scientific discovery. Under a web-search-disabled protocol, the top model exceeds prior state-of-the-art on only 17.8% of tasks. Agents primarily succeed by translating scientific problems into supervised learning tasks, not through original scientific invention.
MEMPROBE is a benchmark that evaluates long-term memory in AI agents by reconstructing a user's hidden state from the agent's memory after interaction. It tests 5 memory systems across 50 simulated users with 31 dimensions each, finding that task completion is high even for memoryless agents, while memory recovery remains moderate and drops under top-k retrieval. MEMPROBE enables direct, auditable assessment of memory retention and proposes recovery as a key objective for future agent development.
Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models that simulate agentic environments across seven domains using long chain-of-thought reasoning. Trained via a three-stage pipeline—CPT, SFT, and RL—these models outperform existing frontier models on AgentWorldBench, a benchmark derived from real-world interactions of five models on nine established tasks.
A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.
CrewAI 1.14.8a3 introduces unified declarative flow loading and improved startup UX for crew runs. It consolidates crewai run and flow kickoff commands, adds declarative Flow CLI support, and enables @router() as a flow start method with typed output schemas for tools.
FleetAgent is a cloud-hosted multimodal large language model that processes compact vectorized vehicle-to-network messages to enable efficient, explainable teleoperation. It reduces uplink payload by up to 625 times and KV-cache memory by 625 times compared to raw images or text, and outperforms Qwen2.5-VL-7B on Lingo-Judge and intervention failure rates on the VecEval dataset.
ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.
The Social World Model decomposes social interaction into five dimensions to enable closed-loop learning. It allows open-source models to sustainably improve and retain social capabilities, outperforming baselines and matching closed-source Gemini 3 Flash in key metrics without forgetting across difficulty levels.