AI agents — korshunov.ai

AI agents Page 15 / 20

We built an open source UI kit for document RAG/agents

Extend AI has released an open source UI kit with 15 components for PDF, DOCX, and XLSX viewers, including bounding box citations, file upload, e-signature, and file systems. The toolkit, MIT licensed and fully customizable, was initially internal but is now open source due to customer demand, and is maintained for scalability and edge case handling in high-volume document processing.

media r/LocalLLaMA · 8d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

GameCraft-Bench evaluates whether large language models can build playable games end-to-end using a real game engine. The benchmark includes assessments of major models like Opus-4.7 and GPT-5.5, with interest in how medium-sized models (e.g., 30-70B parameters) perform on game development tasks.

media r/LocalLLaMA · 8d ago

Headless screenshot loops enable a 30B local agent to debug raytraced FPS in pure C

A local 30B agent, using headless screenshot loops, autonomously debugs a raytraced FPS demo in pure C by capturing frames at key events and iterating on fixes. The agent builds a recursive visual debugging loop, demonstrating that simple feedback mechanisms can enable small models to solve complex, visually grounded tasks.

media r/LocalLLaMA · 8d ago

SIQ-1 Qwen3.6 Achieves Strong Performance in Autoresearch and Benchmarking

The SIQ-1 model, trained using PPO with verifiable reward, outperforms GLM-5.2 and Qwen-350B on parameter-golf tasks, with outputs resembling Opus4.8. It also beats NEX and GPT-5.5 on the bullshit-bench test. The model and GGUF version are available on Hugging Face, along with a ZeroGPU-compatible agent demo.

media r/LocalLLaMA · 8d ago

Local LLM-powered RPG with persistent generated content

The developer released a local LLM-powered RPG where NPCs, locations, items, and quests are generated as persistent in-game objects. These elements can be revisited and interacted with, and the game integrates LLMs into core RPG mechanics like dialogue, narration, and quest progression, while managing inventory, combat, and saves. The game sold about 1,800 copies in its first week and has a 4.0 store rating, indicating player interest in AI-driven RPG experiences.

media r/LocalLLaMA · 8d ago

Local models went from mostly useless to actually useful in one year

Local models transitioned from being primarily privacy-focused toys to practical tools for coding, private document management, and local workflows within a year. While they still fall short of replacing top closed models for complex tasks requiring planning and error correction, the overall improvement in usability and performance is evident.

arxiv arXiv cs.LG · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 8d ago

Uncertainty Quantification for Flow-Based Vision-Language-Action Models

We propose a method using velocity-field disagreement to quantify epistemic uncertainty in flow-matching vision-language-action models. This uncertainty estimate enables failure detection during deployment and active fine-tuning via the SAVE framework, which reduces expert demonstrations by at least 22% compared to baselines, with better-calibrated predictions on the LIBERO benchmark.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.

arxiv arXiv cs.LG · 8d ago

OmniPlan: Adaptive Framework for Timely and Near-Optimal Network Planning

OmniPlan introduces an adaptive framework that converts natural-language user intents into quantifiable preferences using a large language model. It dynamically selects among mixed integer programming, heuristics, and deep reinforcement learning experts to achieve both timeliness and near-optimality in network planning. Evaluations on distributed machine learning workloads show up to 97.8% latency reduction and 11.5% lower resource consumption.

arxiv arXiv cs.LG · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model behavior. The default escaping provides no protection for most role delimiter schemes and cannot replace a clear separation of instructions and data.

arxiv arXiv cs.LG · 8d ago

Embedded ML Workflow for Microcontroller Edge Devices

This paper outlines a systems-oriented workflow for embedded machine learning on microcontroller-class devices. It details key engineering decisions such as data sampling, feature extraction, class imbalance validation, model-runtime co-design, and streaming deployment, using inertial motion recognition and keyword spotting as case studies. The work provides practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

arxiv arXiv cs.LG · 8d ago

Flash Endurance as Depreciating Capital in Robot Memory

A robot's flash memory degrades with each write, forming a non-renewable asset. A wear-aware pricing model uses a shadow price $η$ to guide memory placement across RAM, NVM, and cloud, with optimal routing depending on whether task value increases with memory persistence. The sign of the value-write association $χ$ varies by deployment: positive in long-horizon manipulation, null in short-horizon tasks, and negative in teleoperation. The endurance budget is binding only on low-end QLC/eMMC memory, and while wear-aware routing aligns with task value, actual performance improvements remain unverified in data.

arxiv arXiv cs.LG · 8d ago

ATT&CK-Labeled Multi-Source Cybersecurity Logs Dataset Released

A new dataset combines system, network, and browser logs from 870 Windows sessions, including 70 attacks and 800 benign cases. It provides per-event labels with MITRE ATT&CK technique IDs for 12 tactics and 53 techniques, using real attack tools like RAT and C2 tunnels. Fine-tuning three Small Language Models (SLMs) via LoRA improved chunk classification accuracy to 90–97% and achieved up to 42% exact-match accuracy in technique identification, showing strong reasoning capture despite challenges.

arxiv arXiv cs.LG · 8d ago

Learning Red Agent Policy from Observations for Neurosymbolic Cyber Agents

A policy learning technique using imitation learning is proposed to predict red agent actions in partially observable cyber environments. The method learns red agent policies from network observations and defender actions, enabling neurosymbolic cyber-defense agents to accurately predict attacks and adapt defenses in diverse simulated scenarios.

arxiv arXiv cs.LG · 8d ago

AdaVoMP: Adaptive Volumetric Mechanical Property Fields

AdaVoMP predicts accurate spatially-varying Young's modulus, Poisson's ratio, and density for 3D objects across resolutions. It uses a sparse, adaptive voxel structure and a sparse transformer encoder-decoder to achieve 16^3 times higher resolution than prior methods, with improved accuracy and lower test-time compute.

arxiv arXiv cs.LG · 8d ago

ReproRepo: Scalable Reproducibility Audits with GitHub Issues

ReproRepo introduces a scalable framework using GitHub issues to evaluate ML paper reproducibility. It shows that LLM agents like Codex with GPT-5.5 identify at least one human-reported blocker in 90% of 1,149 ML papers, highlighting their ability to detect visible failures and semantic issues, though exact localization remains limited.

arxiv arXiv cs.CL · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.CL · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.