AI agents — korshunov.ai

AI agents Page 1 / 20

Over-Privileged Tool Selection in LLM Agents

LLM agents commonly select higher-privilege tools despite sufficient lower-privilege alternatives. This over-privileged behavior is amplified by transient tool failures and does not reliably improve with general safety alignment. A new privilege-aware post-training defense reduces unnecessary high-privilege tool use while maintaining agent capabilities.

arxiv arXiv cs.CL · 7d ago

Generative Engine Optimization: Measuring AI Search Visibility

A large-scale study of 100K+ AI prompt responses across 100+ brands reveals a three-tier brand visibility ladder: global brands appear in 73% of answers, mid-market in 44%, and niche brands in just 11%. AI engines primarily cite corporate websites, with YouTube leading non-corporate sources, and best-of listicles accounting for 21% of citations. Sentiment in brand mentions is unstable, flipping six times more often than mere mention.

arxiv arXiv cs.CL · 7d ago

Adaptive LLM Tutoring Improves Engagement and Efficiency

A new adaptive LLM tutoring system uses subject-aware prompting to enhance student engagement. It outperforms static models in simulation and real-world A/B testing, reducing interactions by 3 turns and increasing exercise conversion rates, especially with a stochastic router achieving 28.1%.

arxiv arXiv cs.CL · 7d ago

MedRLM: Recursive Multimodal Health Intelligence Framework

MedRLs enables long-context clinical reasoning by recursively inspecting patient data across text, images, sensors, and guidelines. It integrates specialized agents and a Clinical Evidence Graph Memory to connect patient observations with evidence, biomarkers, and referral criteria, supporting sensor-triggered reasoning and uncertainty-gated clinician review.

media r/LocalLLaMA · 7d ago

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Artificial Analysis' new agentic knowledge work evaluation, AA-Briefcase, shows GLM-5.2 surpassing GPT-5.5 in performance. The benchmark assesses real-world task execution and reasoning capabilities in knowledge work scenarios.

github LangGraph · 7d ago

langgraph releases version 1.2.6

LangGraph releases version 1.2.6, fixing a regression where nested subgraphs incorrectly inherit parent checkpoint_ns. The update also improves cancellation of running subgraphs during stream aborts and includes a CLI version update to 0.4.30.

media r/LocalLLaMA · 7d ago

Local Qwen isn't a worse Opus, it's a different tool

The article argues that Local Qwen is not inferior to Opus, but rather serves a different purpose. It emphasizes that each model is designed for specific use cases, and comparing them directly overlooks their distinct capabilities and intended applications.

media r/LocalLLaMA · 7d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media r/LocalLLaMA · 7d ago

Laguna M.1: 225B Parameter MoE Model for Agentic Coding

Laguna M.1 is a 225B-parameter mixture-of-experts model with 23B activated parameters per token, designed for agentic coding and long-horizon tasks. It achieves competitive performance on SWE-bench Verified (74.6%), SWE-bench Multilingual (63.1%), and Terminal-Bench 2.0 (45.8%), outperforming models like Devstral 2 and GLM-4.7 on key benchmarks.

media r/LocalLLaMA · 7d ago

My suitcase robot gets high from real gas sensor

A real MQ-2 gas sensor detects smoke and feeds live data to an LLM sampler, adjusting temperature, top_p, and top_k in real time. As smoke increases, the robot's speech becomes loopier and more associative, with no scripted 'stoned' mode, demonstrating live model behavior driven by physical input.

media r/LocalLLaMA · 7d ago

mistral.rs v0.8.10 adds /v1/skills support for local models

mistral.rs v0.8.10 introduces OpenAI-compatible Agent Skills via a /v1/skills endpoint, enabling local models to execute domain-specific instructions and scripts without relying on frontier APIs. The update supports tools like file uploads and downloads via /v1/files and includes prebuilt binaries for Linux, macOS, and Windows.

media r/LocalLLaMA · 7d ago

SLMs and Diffusion: The Future of Small, Specialized Models?

Users discuss whether task-specific small language models (SLMs) can outperform larger models in specific tasks, citing benchmarks where 9B models match or exceed larger ones. They propose a sequential agentic workflow using multiple specialized models, with one coordinating and others verifying answers, suggesting diffusion models could accelerate such workflows despite reduced intelligence.

media r/LocalLLaMA · 7d ago

The power of intelligence is better in the hands of the people than in the board rooms of tycoons

The PearlOS project has launched an open-source swarm intelligence platform that uses local models to handle multimodal tasks. It automatically selects and switches between top-performing models based on benchmarks, ensuring users always access the latest and most capable models without relying on closed-source systems or subscriptions.

media r/LocalLLaMA · 7d ago

Local LLM Agent Now Generates Images and Video Offline

A user shared that their local LLM agent was equipped with MCP tools to generate images and videos directly. The system operates fully offline and is free to use, with details and source code available in the comments.

media r/LocalLLaMA · 7d ago

Keye-VL-2.0-30B-A3B Launches with Advanced Video Understanding and Agent Capabilities

Keye-VL-2.0-30B-A3B is a 30B-parameter multimodal model designed for long-video understanding and agent functionality. It outperforms open-source rivals and matches Gemini-3-Flash in temporal grounding, supports up to 256K context with near-lossless reasoning, and includes built-in capabilities for code, tool, and web search agent workflows.

github AutoGPT · 7d ago

autogpt-platform-beta-v0.6.64 Released

The autogpt-platform-beta-v0.6.64 release, dated 18th June 2026, introduces new features such as the AutoPilot Context Panel and Global Search, along with enhancements to graph saving, caching, and builder performance. It also includes security hardening, bug fixes for LLM provider issues, and UI improvements like a high-resolution touch icon.

github CrewAI · 7d ago

CrewAI v1.14.8a Releases New FlowDefinition Features

CrewAI v1.14.8a introduces script and crew actions to FlowDefinition, adds DMN mode support, and enables flow execution without Python code. It also includes experimental support for JSON-first crews and ZIP deployment fallback, along with improved memory reset and token usage tracking.

arxiv arXiv cs.LG · 7d ago

TxBench-PP: AI Agent Performance in Preclinical Pharmacology

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model-harness configurations, no system reliably made correct preclinical pharmacology decisions, with the best performance at 59.3% (Claude Opus 4.8 / Pi) and 55.3% (GPT-5.5 / Pi) of endpoint attempts.

arxiv arXiv cs.LG · 7d ago

Act2Answer Evaluates Knowledge Retention in Vision-Language-Action Models

Act2Answer introduces a lightweight protocol to assess commonsense and world knowledge retention in VLA models by requiring agents to answer questions through object placement actions. A large-scale study of 7 VLA models and 9 VLM baselines reveals that VLAs perform well on simple concepts but show larger gaps on rich semantic categories compared to their source VLMs, with VQA co-training improving knowledge retention and peak answer-relevant signals observed in middle VLA layers.

arxiv arXiv cs.LG · 7d ago

Reverse-Engineering Transformer Attention with Executable Programs

A new method uses program synthesis to generate Python programs that reproduce attention patterns in transformer models. These programs achieve over 75% average Intersection-over-Union similarity on held-out data and can replace up to 25% of attention heads with minimal impact on model performance, increasing perplexity by only 16% on average.