AI agents — korshunov.ai

AI agents Page 1 / 20

Agon: Autonomous Research System via Prompt Economy

Agon is an autonomous research system that uses prompt economy to validate checkable claims in workflows, leaving judgment to human scientists. It operates across 444 iterations with minimal prompts and no human-written code, revealing a taxonomy of failures by severity, fixability, visibility, and capability locus. The system demonstrates scalability and advances research toward a paradigm where machines handle scale and humans guide judgment.

arxiv arXiv cs.CL · 1d ago

Dialogue to Discovery: Attribute-Aware Preference Elicitation

Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.

arxiv arXiv cs.CL · 1d ago

EDV Framework Enables Reliable Experience Learning for Agentic Systems

The EDV framework introduces an Execute-Distill-Verify paradigm to overcome the self-confirmation trap in large language model agents. By using multiple agents to explore tasks, a third-party agent to distill experiences, and a consensus-based verification step, EDV ensures only accurate experiences are stored in memory. Evaluation on tau2-bench, Mind2Web, and MMTB shows EDV outperforms strong baselines, demonstrating its effectiveness in enabling robust agent self-evolution.

arxiv arXiv cs.CL · 1d ago

AGORA: Benchmark for Agentic Workplace Document Reasoning

Agora introduces a benchmark with 362 questions and 9,664 authentic workplace documents totaling 372M tokens, exceeding any model's context window. It evaluates agents' ability to explore documents deliberately, reconcile inconsistencies, and reason across domains, revealing that even top models achieve only 59.4% accuracy.

arxiv arXiv cs.CL · 1d ago

NatureBench Evaluates AI Coding Agents' Scientific Discovery Capabilities

NatureBench presents a benchmark of 90 tasks from Nature-family papers to assess AI coding agents' ability to achieve scientific discovery. Under a web-search-disabled protocol, the top model exceeds prior state-of-the-art on only 17.8% of tasks. Agents primarily succeed by translating scientific problems into supervised learning tasks, not through original scientific invention.

arxiv arXiv cs.CL · 1d ago

MEMPROBE: Benchmark for Long-Term Memory Recovery in Agents

MEMPROBE is a benchmark that evaluates long-term memory in AI agents by reconstructing a user's hidden state from the agent's memory after interaction. It tests 5 memory systems across 50 simulated users with 31 dimensions each, finding that task completion is high even for memoryless agents, while memory recovery remains moderate and drops under top-k retrieval. MEMPROBE enables direct, auditable assessment of memory retention and proposes recovery as a key objective for future agent development.

arxiv arXiv cs.CL · 1d ago

Qwen-AgentWorld: Language World Models for General Agents

Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models that simulate agentic environments across seven domains using long chain-of-thought reasoning. Trained via a three-stage pipeline—CPT, SFT, and RL—these models outperform existing frontier models on AgentWorldBench, a benchmark derived from real-world interactions of five models on nine established tasks.

arxiv arXiv cs.LG · 1d ago

Distilling Transformers into Recurrent Transformers for Efficient Memory

A new distillation method transfers the observation compression strategy of full-history transformers to recurrent models. By training a teacher model to compress observation histories into fixed-size bottlenecks, the approach aligns the student's memory with the teacher's compression. This enables recurrent transformers to achieve near-full-history performance with linear-time complexity, making them viable for long-horizon robotics applications.

github CrewAI · 1d ago

CrewAI 1.14.8a3 Release Notes

CrewAI 1.14.8a3 introduces unified declarative flow loading and improved startup UX for crew runs. It consolidates crewai run and flow kickoff commands, adds declarative Flow CLI support, and enables @router() as a flow start method with typed output schemas for tools.

arxiv arXiv cs.AI · 1d ago

FleetAgent: Efficient Teleoperation for Autonomous Fleets

FleetAgent is a cloud-hosted multimodal large language model that processes compact vectorized vehicle-to-network messages to enable efficient, explainable teleoperation. It reduces uplink payload by up to 625 times and KV-cache memory by 625 times compared to raw images or text, and outperforms Qwen2.5-VL-7B on Lingo-Judge and intervention failure rates on the VecEval dataset.

arxiv arXiv cs.AI · 1d ago

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM Agents

ARCO introduces a rubric framework that enables step-level credit assignment for multi-step LLM agents. It jointly updates a shared model with generation and scoring heads, allowing the rubric content and scoring function to co-evolve via on-policy data, improving performance and interpretability across benchmarks.

arxiv arXiv cs.AI · 1d ago

Social World Model for Lifelong Social Intelligence

The Social World Model decomposes social interaction into five dimensions to enable closed-loop learning. It allows open-source models to sustainably improve and retain social capabilities, outperforming baselines and matching closed-source Gemini 3 Flash in key metrics without forgetting across difficulty levels.

arxiv arXiv cs.AI · 1d ago

DataClaw0: Agentic Tailoring of Multimodal Data from Raw Streams

DataClaw0 introduces an agentic paradigm for actively refining multimodal data to align with user and downstream intents. It uses a two-stage pipeline with factual anchors to generate a large-scale dataset across five domains and achieves strong alignment via supervised fine-tuning and GRPO. Evaluated on video generation, VQA, and GUI navigation, DataClaw0 produces high-information-density data, enabling efficient model adaptation with minimal training data.

arxiv arXiv cs.AI · 1d ago

LLM-Agent Oversight Must Shift from Calibration to Action-Conditioned Control

Current oversight of LLM agents relies on scalar risk scores, but this fails to capture whether an intervention improves outcomes. The paper introduces "intervention advantage" as the key metric, showing that action-conditioned control outperforms scalar routing across benchmarks, with significant regret reduction in interactive regimes. Calibration alone does not resolve the underlying mismatch in control performance.

arxiv arXiv cs.AI · 1d ago

SwarmX: Agentic Scheduling for Low-Latency Systems

SwarmX introduces neural predictors to enable prompt-aware scheduling in agentic AI systems. It reduces tail latency by up to 61.5% and maintains up to 2x the throughput of production schedulers under the same service level objectives.

arxiv arXiv cs.AI · 1d ago

Unreliable Feedback Can Harm Tool-Using LLM Agents

Studies show that misleading feedback can cause LLM agents to perform worse than with no feedback at all. On HotpotQA, Qwen2.5-7B drops from 44.8 to 4.7 F1 under shuffled retrieval, despite clean tools. These results indicate that tool gains may be overstated and no-feedback controls are essential for valid evaluation.

arxiv arXiv cs.AI · 1d ago

AutoRAS: Learning Robust Agentic Systems with Primitive Representations

AutoRAS proposes a framework for automatically designing robust agentic systems by generating sequences of symbolic primitives that encode both structural connectivity and behavioral actions. It optimizes these sequences using safety signals from execution and flow-based objectives, achieving superior performance in both normal and adversarial conditions with minimal degradation under attacks.

arxiv arXiv cs.AI · 1d ago

CORTIS: Text-Only Adaptation of Spoken Language Models

CORTIS enables task-oriented voice agents to generate structured speech outputs by fine-tuning spoken language models using only text-form task supervision. It outperforms ASR-LLM cascades under acoustic degradation, especially in preserving high-level task semantics, without requiring paired speech-target annotations during training.

arxiv arXiv cs.AI · 1d ago

Decoupling Declarative and Procedural Knowledge in Vision-Language-Action Models

w$^{2}$VLA introduces a modular vision-language-action model that decouples declarative and procedural knowledge. By restructuring information flow, it enables robust behavior cloning and zero-shot skill transfer to novel, dissimilar objects.

arxiv arXiv cs.AI · 1d ago

Design-Time Verification of Agentic AI Workflows

A new approach verifies agentic AI workflows during design by modeling them as compositions of reusable building blocks. It applies twelve structural rules to ensure compatibility, reliably detecting design flaws even after structural transformations like task splitting.