AI agents — korshunov.ai

AI agents Page 12 / 20

LLM-as-Interface, ML-as-Predictor for Pediatric Appendicitis

ClaMPAPP, a hybrid system, uses an LLM to extract structured clinical features from free-text notes and passes them to an XGBoost classifier for diagnosis. It outperformed end-to-end LLMs in both internal and external validation, with better diagnostic performance and fewer missed cases, demonstrating superior stability and safety in pediatric appendicitis triage.

arxiv arXiv cs.AI · 7d ago

Decision-Focused RL for EV Charging with Unknown Departure Times

A decision-focused RL framework jointly trains a forecaster and charging controller to handle unknown EV departure times. The method improves charging decisions by up to 14% in total reward and reduces unsupplied energy by 55% compared to standard RL without forecasting.

arxiv arXiv cs.AI · 7d ago

TxBench-PP: AI Agent Benchmark in Preclinical Pharmacology

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model configurations, no system reliably passed all evaluations, with the best performing setup (Claude Opus 4.8 / Pi) achieving 59.3% success rate on 300 endpoint attempts.

arxiv arXiv cs.AI · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.

arxiv arXiv cs.AI · 7d ago

X+Slides: Benchmark for Audience-Conditioned Slide Generation

X+Slides introduces a benchmark that evaluates slide generation based on target audience needs. It uses 8,133 source-grounded probes across 113 topics and seven scenes to measure Audience Coverage, Domain-wise Coverage, Efficiency, and Correctness, revealing that current systems recover only partial audience-essential information, with DeepPresenter achieving 0.714 Audience Coverage, SlideTailor 0.594, and NotebookLM ablation 0.853, highlighting the need for source-grounded evaluation.

arxiv arXiv cs.AI · 7d ago

Self-Correction Boosts Trust in Social Chatbots

A study finds that social chatbots correcting their own errors earn higher user trust and perceived expertise than those relying on external corrections. The strength of user-chatbot social connection enhances belief change only when the chatbot self-corrects, showing that social connection amplifies error correction effectiveness.

arxiv arXiv cs.AI · 7d ago

Data Intelligence Agents Enable Autonomous Data Querying

Data Intelligence Agents (DIA) deploy autonomous coding agents to streamline enterprise data workflows. The Query Generator matches or exceeds top published results on seven SQL benchmarks across four dialects, showing generalization through natural-language instructions and execution-based architecture.

arxiv arXiv cs.AI · 7d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.CL · 7d ago

Multi-Agent Fictitious Play for Stance-Entangled Decision-Making

A new multi-agent system, Multi-Agent Fictitious Play (MAFP), addresses stance entanglement in decision-making by modeling stakeholder perspectives as agents. MAFP uses game-theoretic fictitious play to iteratively improve decisions through mutual best responses, outperforming baselines on tournament strength and robustness in competitive scenarios.

arxiv arXiv cs.CL · 7d ago

Turing-RL: Learning User Simulators with Turing Rewards

Turing-RL introduces a reinforcement learning method using an LLM judge to evaluate how indistinguishable generated responses are from real user inputs. It outperforms baseline methods in both LLM and human evaluations across chat and Reddit forum domains, demonstrating that optimizing for indistinguishability improves user simulator performance.

arxiv arXiv cs.CL · 7d ago

OmniAgent: Native Active Perception for Omni-Modal Understanding

OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).

arxiv arXiv cs.LG · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.LG · 7d ago

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

GrapNet introduces a programmable neural graph substrate where architecture edits are first-class operations. It outperforms dense MLPs on Split Fashion-MNIST and CIFAR-10, achieving 63.16% and 3.81% accuracy gains respectively, with statistically significant results.

arxiv arXiv cs.LG · 7d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout introduces a self-speculative decoding framework that reduces rollout and end-to-end latency by up to 19.6% and 12.7% respectively, without compromising final model quality. It uses a quantized drafter derived from the target model and integrates a system-aware toggle policy to avoid compute-bound regimes, enabling effective speculation during evolving policy generations.

arxiv arXiv cs.LG · 7d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4x while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism on-the-fly, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.LG · 7d ago

ViGOS: Decoupling Perception and Reasoning in Multimodal On-Policy Self-Distillation

ViGOS introduces a visually grounded on-policy self-distillation framework for multimodal large language models. It decouples perception and reasoning by using an image-only teacher for visual descriptions and a reasoning teacher for final outputs, reducing reliance on text-only references. This approach improves image-grounded performance across multiple vision-language benchmarks.

arxiv arXiv cs.LG · 7d ago

OpenAnt: LLM-Powered Vulnerability Discovery System

OpenAnt uses code decomposition, adversarial verification, and dynamic testing to identify vulnerabilities in large codebases. It reduces analysis surface by up to 97% and cuts false positives while validating findings through automated, sandboxed execution. Evaluated on OpenSSL, WordPress, and Flowise, it discovers previously unknown vulnerabilities with manageable cost and scalability.

arxiv arXiv cs.CL · 7d ago

PhysAssistBench Evaluates LLMs in Doctor-Patient-EHR Interaction

PhysAssistBench introduces a benchmark for interactive doctor-patient-EHR assistance using real MIMIC-IV cases. It features 1,296 manually reviewed, physician-validated turns and reveals that current LLMs struggle with coordinating clinical knowledge, communication, and EHR system interaction.

arxiv arXiv cs.CL · 7d ago

PEC-Home: Simulated Dataset for Elliptical Command Interpretation

PEC-Home is the first simulated dataset designed to enable smart home assistants to interpret progressively elliptical commands. Experiments show that even with dialogue history tools, LLMs like GPT-4o fail to achieve accurate command execution from elliptical inputs, highlighting a significant gap in current assistant capabilities.

arxiv arXiv cs.CL · 7d ago

EARS Framework Enhances Multi-Agent System Reliability

EARS introduces explanatory abstention in sub-agents to improve reliability in large-scale multi-agent systems. By providing actionable failure rationales to coordinators, EARS increases the overall response pass rate from 68.5% to 78.9% in a production e-commerce assistant.