korshunov.ai — ML news

Results

Sort

Lab Topic Source

v2.1.183 Release Notes

v2.1.183 improves auto mode safety by blocking destructive git and destroy commands without explicit user consent. It adds deprecation warnings for models, introduces attribution.sessionUrl to hide session links, and fixes multiple issues including terminal behavior, subagent performance, and input handling in web and tmux environments.

github AutoGPT · 8d ago

autogpt-platform-beta-v0.6.64 Released

The autogpt-platform-beta-v0.6.64 release, dated 18th June 2026, introduces new features such as the AutoPilot Context Panel and Global Search, along with enhancements to graph saving, caching, and builder performance. It also includes security hardening, bug fixes for LLM provider issues, and UI improvements like a high-resolution touch icon.

lab Claude Code Releases · 8d ago

Claude Code v2.1.181 Release Notes

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

lab Claude Code Releases · 10d ago

Claude v2.1.178 Release Notes

Claude v2.1.178 introduces new permission rules using Tool(param:value) syntax, improved workflow and skill loading in nested directories, and enhanced auto mode and error messaging. It fixes critical issues including crashes, authentication errors, and UI behavior in Chrome and VSCode, while refining tool prompts and undo functionality.

arxiv arXiv cs.LG · 7d ago

VIMPO: Critic-Free Policy Optimization for LLMs

VIMPO introduces a critic-free policy optimization method that derives a policy-implied value function from KL-regularized reinforcement learning. It enables verifiable reward incorporation without training a critic and outperforms GRPO on mathematical benchmarks, especially under noisy rewards.

arxiv arXiv cs.LG · 7d ago

LLM-based Hierarchical Control in Multi-Agent Games

A hierarchical system using a pretrained LLM to select RL skill policies outperforms flat RL in a 2v2 King of the Hill environment. It matches hand-crafted behavior tree performance in win rate and is perceived as more human-like by 60% of users, highlighting effective coordination and adaptability without manual rule design.

arxiv arXiv cs.LG · 7d ago

Pose6DAug: Physically Plausible Multi-view Object Swapping

Pose6DAug enables robot data augmentation by swapping objects in successful episodes while preserving physically valid 6D pose trajectories. It operates in 3D using a mesh anchored by temporally coherent poses, ensuring multi-view consistency and physical plausibility. Fine-tuning a VLA policy on this augmented data improves novel object success rates by 16.5% over state-of-the-art baselines.

arxiv arXiv cs.LG · 7d ago

LLM-Generated GPU Kernels Face Correctness Illusion

Benchmarks using fixed-shape checks miss real bugs in LLM-generated GPU kernels. A controlled corpus of 24 kernels, including 9 buggy variants with transcription errors, reveals that an op-schema-aware oracle detects all failures and passes all correct controls, with identical results across five GPU architectures.

arxiv arXiv cs.LG · 7d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, fast safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.CL · 7d ago

Benchmarking Agentic Review Systems for AI-Assisted Research

A study evaluates four AI review systems across six language models, finding OpenAIReview with GPT-5.5 achieves 83.0% accuracy in matching paper quality to external signals and detects 71.6% of injected errors. Real user feedback shows positive sentiment, with a 1.44-to-1 vote ratio, though false positives and minor nitpicks remain common.

arxiv arXiv cs.CL · 7d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 7d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 7d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 7d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 7d ago

AtomMem: Simple and Effective Memory System for LLM Agents

AtomMem introduces a memory system that stores high-value atomic facts from long-form interactions. It uses hierarchical event structures and temporal profiles to capture coherent episodic contexts and track evolving user attributes, enabling stable and efficient memory evolution. Experiments on the LoCoMo benchmark show AtomMem achieves state-of-the-art performance in reasoning tasks.

arxiv arXiv cs.CL · 7d ago

REDACT: Multilingual PII Benchmark with Systematic Control

REDACT introduces a systematically controlled multilingual benchmark for personally identifiable information detection, featuring 51 entity types, 4,127 surface-form patterns, and 25 languages. It evaluates five detectors across 1,000 records, revealing that rule-based models fail on high-stakes data while LLMs perform better, especially in high-sensitivity categories. A reference-free LLM assessment confirms sensitivity-tier assignment as the most challenging evaluation axis.

arxiv arXiv cs.CL · 7d ago

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS enables training-free superposition of multiple semantic directions in LLMs by addressing distributional deviation and directional interference through geometric constraints. On GSM8K, it maintains 98% accuracy with three non-mathematical directions, while unconstrained addition drops to 4%; on Wikitext-2, it increases PPL by only 2.2%.

arxiv arXiv cs.CL · 7d ago

Over-Privileged Tool Selection in LLM Agents

LLM agents commonly select higher-privilege tools despite sufficient lower-privilege alternatives. This over-privileged behavior is amplified by transient tool failures and does not reliably improve with general safety alignment. A new privilege-aware post-training defense reduces unnecessary high-privilege tool use while maintaining agent capabilities.

arxiv arXiv cs.CL · 7d ago

STAGE: Source-Grounded Data Generation for Text-to-JSON

STAGE is a pipeline that generates text-to-JSON training data by using LLMs to synthesize reports and JSON schemas, validated against underlying spreadsheets. Evaluations on STAGE-Eval show it improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

arxiv arXiv cs.CL · 7d ago

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.