AI agents — korshunov.ai

AI agents Page 1 / 21

Coding Benchmarks Misaligned with Agentic Software Engineering

Current coding benchmarks were designed before agentic software engineering and fail to capture the complexity of real-world systems. They conflate model performance with the entire harness, ignore valid alternative solutions, and lack feedback signals at individual component levels, making iterative improvement difficult.

arxiv arXiv cs.CL · 9d ago

A Framework for Evaluating Agentic Skills at Scale

We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.

arxiv arXiv cs.CL · 9d ago

Automated Prompt Optimization for LLM Game Agents

A new framework automates prompt refinement for LLM agents by splitting the observation-to-action pipeline into goal-conditioned and action selection modules. It uses an LLM-driven evolutionary loop to iteratively improve prompts based on environment feedback, achieving up to 72.5% success on PutNext where prior agents failed, without model fine-tuning.

arxiv arXiv cs.CL · 9d ago

GameCraft-Bench: Evaluating End-to-End Game Generation

GameCraft-Bench introduces a benchmark with 140 Godot tasks across 15 game families to assess coding agents' ability to generate playable games. Evaluations show the best agent achieves only 41.46% success, indicating significant challenges in producing complete, interactive games with coherent gameplay and visual feedback.

media r/LocalLLaMA · 9d ago

We Open Sourced Our LLM-based QA Agent To Catch Breakages Faster

Approxima is an open-source, self-hostable QA agent that monitors user journeys and supports Claude, Gemini, and GPT out of the box. It features Explore Mode, A/B Testing, and Self-healing to adapt to product evolution, with full support for local models and community contributions.

lab Claude Code Releases · 9d ago

Claude v2.1.178 Release Notes

Claude v2.1.178 introduces new permission rules using Tool(param:value) syntax, improved workflow and skill loading in nested directories, and enhanced auto mode and error messaging. It fixes critical issues including crashes, authentication errors, and UI behavior in Chrome and VSCode, while refining tool prompts and undo functionality.

media r/LocalLLaMA · 9d ago

Community model build thread: Crowdsourced training feasible

A community model can be built through crowdsourced compute using a 'Branch-Train-Stitch' approach. Participants train a prototype model on their hardware, submit narrow-domain submodels, and organizers stitch them into a large Mixture-of-Experts (MoE) model, with key decisions including prototype size, scope definitions, and training protocols.

media r/LocalLLaMA · 9d ago

Is DiffusionGemma really that good in a PI agent?

A Reddit post asks whether DiffusionGemma performs exceptionally well in a PI agent. The post includes a link to an image and references comments section for further discussion.

media r/LocalLLaMA · 9d ago

Qwen Robot Suite Announced

Aliyun has launched the Qwen Robot Suite, a new set of AI-powered robotic tools. The suite aims to enable developers to build and deploy intelligent robots with enhanced capabilities.

media Interconnects · 9d ago

Frontier Post-Training Recipe Review with Finbarr Timbers

The podcast reviews the evolution of post-training recipes in large language models, from InstructGPT to 2026 frontier models. It highlights Multi-Teacher On-Policy Distillation (MOPD) as the dominant pattern, where domain-specialist models are trained and then distilled into a general student model via on-policy distillation, scaling to over 10 teachers in models like DeepSeek V4 and Nemotron 3 Ultra.

media r/LocalLLaMA · 9d ago

Why DiffusionGemma Might Excel at Tool Calls Despite Lower Base Quality

DiffusionGemma uses bidirectional attention to allow self-correction during token generation, enabling it to revise earlier tokens in a 256-token block. This capability gives it a structural advantage in generating valid tool calls, as it can correct malformed outputs that autoregressive models cannot fix once committed.

media r/LocalLLaMA · 9d ago

Learning Context and Harness Engineering for Local-First AI

A user seeks guidance on learning context and harness engineering for building local-first AI applications with specialized use cases. They express interest in avoiding general-purpose AI models like Hermes or OpenClaw and ask where to find resources, given their background in MCP servers and tool calling.

media r/LocalLLaMA · 9d ago

Be wary of Qwen/Claude distillations - they're often worse than the base model

Distillations of Qwen and Claude models, such as Qwen 3.6 distilled with only 4,000 samples, rarely improve performance and often degrade quality. These models may exhibit a more 'Opus-like' style but fail to transfer actual capability, with some showing hallucinations and slower response times compared to the base models, as demonstrated in testing and user reports.

media r/LocalLLaMA · 10d ago

Are small local models for automation a thing?

A Reddit user argues that small, efficient local LLMs (1B to 4B parameters) embedded in scripts can enable practical automation of repetitive tasks. They note this use case is underrepresented in discussions focused on coding assistants or hardware performance, suggesting a gap in community interest or visibility for task-specific, lightweight AI models.

media r/LocalLLaMA · 10d ago

DGX Spark is being defamed

The DGX Spark is being unfairly criticized despite its strong scalability and usable local AI performance. Its ConnectX technology allows lossless expansion, and at 240W power, it enables running agentic DS4Flash locally for around $9k with 256GB of CUDA memory.

arxiv arXiv cs.CL · 10d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 10d ago

IMPACTeen Dataset Released with English and Polish Versions

IMPACTeen is a dataset of 1,021 texts annotated from five perspectives—teenagers, parents, psychologists, communication experts, and teachers. It includes 5,100 annotation records covering social influence techniques, intentions, consequences, and resistance, with annotations validated through human editing. The dataset, created using LLM generation and human validation, is available in both Polish and English and supports research on social influence and language model training.

arxiv arXiv cs.CL · 10d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 10d ago

DeepRubric: Efficient RL for Deep Research Agents

DeepRubric introduces a data construction framework that builds query-rubric pairs by first defining verifiable evaluation targets through an evidence tree. It generates 9K supervision examples and trains a 8B model with GRPO, achieving performance comparable to state-of-the-art models using 13x fewer RL GPU-hours.

arxiv arXiv cs.CL · 10d ago

KVEraser: Efficient Localized Context Erasing in LLMs

KVEraser enables efficient localized context erasing in large language models by replacing only the KV cache states of an erased span with learned steering states. It achieves near-full-recomputation performance on in-domain tasks across 1K to 32K context lengths, with only a 24% latency increase, and outperforms other approximate methods in long-document QA with 3--4x speedup over full recomputation.