AI agents — korshunov.ai

AI agents Page 13 / 20

LegalWorld: Life-Cycle Environment for Legal Agents

LegalWorld models Chinese civil litigation as a causally connected chain of five stages, based on 75,309 judgments. It includes reusable infrastructure to maintain consistency across stages and enables LongJud-Bench to evaluate agent performance across all phases, revealing significant capability gaps between models in different legal tasks.

arxiv arXiv cs.CL · 7d ago

HandwritingAgent: Language-Driven Handwriting Synthesis in SVG

HandwritingAgent synthesizes natural handwriting in SVG format without style-specific training. It uses a large reasoning model to generate stroke sequences in a grid canvas, conditioned on text input and a reference style image, enabling efficient, controllable, and generalizable handwriting generation.

arxiv arXiv cs.CL · 7d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem introduces a benchmark for multi-principal shared-memory agents, evaluating utility, access control, and active forgetting across medical, office, education, and household domains. No method achieves strong performance in all three governance aspects, with long-context prompting offering best results at high cost, while retrieval-based and external-memory approaches reduce cost but still suffer from information leaks.

arxiv arXiv cs.CL · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.CL · 7d ago

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning

ScholarSum introduces a hierarchical knowledge graph framework that emulates a student-teacher process for scientific summarization. It generates fluent, factually consistent summaries by first structuring documents into semantic units, then refining drafts through evidence retrieval and iterative review by a teacher-like component. Experiments show ScholarSum outperforms existing methods in completeness and factual faithfulness.

arxiv arXiv cs.CL · 7d ago

Rubric-Guided Counterfactual Recommendations for Medical Communication

A new pipeline uses language models to recommend minimal, interpretable changes to patient-doctor communication features like tone and personalization. These changes increase predicted positive feedback by an average of 6.41% and are non-negative for 93.31% of cases, without altering medical content.

arxiv arXiv cs.CL · 7d ago

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE is a multi-agent framework for prompt optimization that combines diagnostic code execution with quantitative validation. It improves mental-health chatbot retention by aggregating eight cycles of noisy A/B tests into statistically significant gains, demonstrating effectiveness in open-ended dialogue tasks through qualitative and quantitative feedback integration.

arxiv arXiv cs.CL · 7d ago

SenFlow: Advanced AI-Generated Text Detection in Hybrid Documents

SenFlow introduces a novel method for detecting AI-generated text in hybrid documents by modeling inter-sentence dependencies. It achieves state-of-the-art performance on MOSAIC, a benchmark of 16,000 documents from PubMed and XSum, with a +4.15 pp Macro-F1 gain on cross-domain transfer. SenFlow reveals that AI-generated content still exhibits generator-dependent sentence-length patterns, exploitable by sentence-level detectors despite perplexity filtering.

arxiv arXiv cs.CL · 7d ago

Decoupling Search from Reasoning in LLM Agents

Decoupled Search Grounding (DSG) separates search functionality from reasoning models, enabling vendor-agnostic, tunable, and reusable search grounding. DSG achieves near-native accuracy on SimpleQA with 91% lower search cost and 99.4% warm-cache hit rate, while reducing latency by 68% and preserving concise output contracts.

arxiv arXiv cs.CL · 7d ago

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO introduces a directed acyclic graph framework to represent reasoning rollouts, merging semantically equivalent paths to reduce redundant exploration. It assigns efficiency and correctness advantages to edges, improving inference efficiency and process supervision while reducing advantage-estimation variance. Experiments show GraphPO outperforms chain- and tree-based methods on three LLMs across reasoning and agentic search tasks under identical token or response budgets.

arxiv arXiv cs.CL · 7d ago

Leadership as Coordination Control in Multi-Agent LLM Teams

Process-level coordination control adds value only when the initial majority consensus is unreliable, the task is recoverable, and unguided interaction fails to repair errors. Across multiple models and tasks, no leadership style outperforms others in accuracy, aligning with contingency theory rather than suggesting a failure of the approach.

arxiv arXiv cs.CL · 7d ago

Human-AI Coevolution Framework Reveals Social Intelligence Emergence

The Human-AI Coevolution Dynamics Framework (HACD-H) introduces a unified model for long-term human-AI interaction, integrating emotional adaptation, memory, and personality into a self-organizing social cognitive system. Results show social intelligence emerges through coevolution, with a significant negative correlation between social intelligence and social cognitive energy (r = -0.391, p < 0.001), and progressive energy reduction over time in interaction trajectories.

arxiv arXiv cs.CL · 7d ago

IndicContextEval: Benchmark for Context Utilisation in Audio LLMs

IndicContextEval introduces a 56-hour multilingual benchmark featuring natural speech from 555 speakers across 8 Indian languages and 23 domains. It employs a 7-level prompting framework to progressively test context utilisation, including metadata, descriptions, and adversarial inputs. Evaluation of five models shows significant differences in contextual grounding, underscoring the need for explicit assessment of context use in AudioLLMs.

arxiv arXiv cs.AI · 7d ago

R2D-RL: RoboCup 2D Soccer Environment for MARL

R2D-RL bridges RCSS2D and HELIOS-based clients with a Python MARL interface using shared-memory and cycle-level synchronization. It enables full-field and scenario-based training with configurable opponents, action masks, EPV-based reward shaping, and parallel execution, including front-goal scenarios and an 11-vs-11 benchmark with baseline results.

arxiv arXiv cs.AI · 7d ago

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM introduces an agentic LLM pipeline that extracts behavioral signals from ride-hailing logs to generate user profiles. It achieves up to +6.14% relative AUC improvement and up to +4.35% GMV gain in dispatching simulations, with consistent online A/B test results showing +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate improvements.

arxiv arXiv cs.AI · 7d ago

Reinforcement Learning Foundation Models Should Already Be A Thing

Reinforcement learning lacks foundation models despite synthetic MDPs being feasible. A proof-of-concept shows a single model trained on synthetic MDPs solves tabular benchmarks without tuning, outperforming existing methods in online settings and matching them offline.

arxiv arXiv cs.AI · 7d ago

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

Intelligence is embedded in the space itself, where scenes induce a Riemannian metric on configuration manifolds. A single Encoder-Router network uses semigroup-superposition to generate this metric, enabling zero-shot generalization across unseen obstacle configurations with large cost differences between collision-free and obstacle-penetrating paths.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

arxiv arXiv cs.AI · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.AI · 7d ago

WorldLines: Benchmarking Long-Horizon Embodied Agent Memory

WorldLines introduces a project-driven benchmark for long-horizon embodied household assistance, capturing extended household traces with dialogues, actions, and state changes. It enables evidence-linked samples for Memory QA and Embodied Task Planning, and proposes ObsMem, an observer-grounded memory framework that supports visibility-aware memories and state-aware decisions. Experiments highlight challenges in partial observability and memory translation, with ObsMem providing a stronger reference architecture for such settings.