korshunov.ai — ML news

Results

Sort

Lab Topic Source

Claude Code v2.1.181 Release Notes

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

lab Claude Code Releases · 10d ago

Claude v2.1.178 Release Notes

Claude v2.1.178 introduces new permission rules using Tool(param:value) syntax, improved workflow and skill loading in nested directories, and enhanced auto mode and error messaging. It fixes critical issues including crashes, authentication errors, and UI behavior in Chrome and VSCode, while refining tool prompts and undo functionality.

arxiv arXiv cs.AI · 8d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.AI · 8d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a panoramic canvas using 3D world coordinates. It achieves state-of-the-art results on SQA3D and VSI-Bench, with strong generalization on SPBench, using significantly less training compute than prior methods.

arxiv arXiv cs.AI · 8d ago

Data Intelligence Agents Enable Autonomous Data Querying

Data Intelligence Agents (DIA) deploy autonomous coding agents to streamline enterprise data workflows. The Query Generator matches or exceeds top published results on seven SQL benchmarks across four dialects, showing generalization through natural-language instructions and execution-based architecture.

arxiv arXiv cs.AI · 8d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.AI · 8d ago

Rubric-Conditioned Self-Distillation Framework

Rubric-Conditioned Self-Distillation introduces a framework that uses structured rubrics to provide fine-grained, token-level feedback during self-distillation of reasoning language models. By conditioning teacher models on rubric-level criteria, it enables more precise credit assignment than scalar rewards, outperforming GRPO and OPSD by 1.0 and 0.9 points on average across science reasoning benchmarks.

arxiv arXiv cs.CL · 8d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arxiv arXiv cs.CL · 8d ago

Rubric-Conditioned Self-Distillation Framework

arxiv arXiv cs.CL · 8d ago

Turing-RL: Learning User Simulators with Turing Rewards

Turing-RL introduces a reinforcement learning method using an LLM judge to evaluate how indistinguishable generated responses are from real user inputs. It outperforms baseline methods in both LLM and human evaluations across chat and Reddit forum domains, demonstrating that optimizing for indistinguishability improves user simulator performance.

arxiv arXiv cs.CL · 8d ago

OmniAgent: Native Active Perception for Omni-Modal Understanding

OmniAgent introduces a POMDP-based iterative Observation-Thought-Action cycle for video understanding, enabling on-demand action execution to selectively distill audio-visual cues into persistent textual memory. It achieves state-of-the-art performance on ten benchmarks, with a 7B agent outperforming a 10× larger Qwen2.5-VL-72B model on LVBench (50.5% vs. 47.3%).

arxiv arXiv cs.LG · 8d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.LG · 8d ago

TAPO: Self-Distillation with Micro-Reflective Trajectories

TAPO advances self-distillation by constructing explicit micro-reflective trajectories that retain erroneous reasoning and insert natural-language diagnoses. These trajectories, derived from correct and incorrect model rollouts, provide fine-grained error corrections anchored in the model's own reasoning, improving both first-pass reasoning and error correction compared to GRPO.

arxiv arXiv cs.LG · 8d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.LG · 8d ago

Unsupervised Reward Optimization for Protein Language Models

A new framework enables protein language models to generate controllable protein sequences without labeled data or wet-lab validation. It uses task-agnostic rewards based on model uncertainty and semantic consistency to guide generation, with Soft and Binarized Reward Optimization outperforming baselines in coverage and controllability across diverse conditions.

arxiv arXiv cs.LG · 8d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout introduces a self-speculative decoding framework that reduces rollout and end-to-end latency by up to 19.6% and 12.7% respectively, without compromising final model quality. It uses a quantized drafter derived from the target model and integrates a system-aware toggle policy to avoid compute-bound regimes, enabling effective speculation during evolving policy generations.

arxiv arXiv cs.LG · 8d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4x while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism on-the-fly, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.LG · 8d ago

ViGOS: Decoupling Perception and Reasoning in Multimodal On-Policy Self-Distillation

ViGOS introduces a visually grounded on-policy self-distillation framework for multimodal large language models. It decouples perception and reasoning by using an image-only teacher for visual descriptions and a reasoning teacher for final outputs, reducing reliance on text-only references. This approach improves image-grounded performance across multiple vision-language benchmarks.

arxiv arXiv cs.CL · 8d ago

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework that enhances large language models' pragmatic reasoning by generating counterfactual reasoning traces and training via supervised fine-tuning and reinforcement learning. It outperforms baseline models on four pragmatic benchmarks, improving Qwen3-8B and Qwen3-14B by 5.37% and 5-5.50% accuracy respectively, and maintains strong performance on general-knowledge and mathematical reasoning tasks.

arxiv arXiv cs.CL · 8d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.