Evaluation & benchmarks — korshunov.ai

Topic · Evaluation & benchmarks

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

arxiv arXiv cs.CL · 6d ago

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.

arxiv arXiv cs.CL · 6d ago

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

media r/LocalLLaMA · 6d ago

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Artificial Analysis' new agentic knowledge work evaluation, AA-Briefcase, shows GLM-5.2 surpassing GPT-5.5 in performance. The benchmark assesses real-world task execution and reasoning capabilities in knowledge work scenarios.

arxiv arXiv cs.LG · 7d ago

Discriminator-Guided RL Corrects Flow Matching with Data-Aligned Rewards

Discriminator-Guided RL (DRL) uses a pretrained representation space to train a discriminator that separates real data from model-generated samples. Its logit is used as a reward in KL-regularized RL, aligning model outputs with visual and semantic realism without human preferences. DRL improves FID and semantic FD across models like SiT and JiT, and enhances the Pareto frontier between preference and fidelity.

arxiv arXiv cs.LG · 7d ago

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

MAST, a mechanism-guided unlearning method, achieves targeted forgetting of RLVR-induced reasoning with minimal collateral damage. On Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, it significantly reduces MATH performance (45/150 to 37/15-0) while preserving GSM8K accuracy by +0.8 points and maintaining MATH retention at -0.5 points. Results hold across different seeds, objectives, and models, showing superior stability over full-parameter unlearning.

arxiv arXiv cs.LG · 7d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.LG · 7d ago

TxBench-PP: AI Agent Performance in Preclinical Pharmacology

TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model-harness configurations, no system reliably made correct preclinical pharmacology decisions, with the best performance at 59.3% (Claude Opus 4.8 / Pi) and 55.3% (GPT-5.5 / Pi) of endpoint attempts.

arxiv arXiv cs.LG · 7d ago

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas enables 3D scene understanding in Vision-Language Models by aggregating patch features onto a single panoramic canvas using 3D world coordinates. It achieves state-of-the-art performance on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using significantly less training compute than existing methods.

arxiv arXiv cs.LG · 7d ago

Zero-Overhead Telemetry Detects Hidden ML Training

A study evaluates GPU workload classification using only zero-overhead NVML telemetry. The classifier achieves 98.2% accuracy in identifying training workloads and 43-87% accuracy against adversarially disguised, unexpected workloads across 9 GPU models.

arxiv arXiv cs.AI · 7d ago

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

MAST, a mechanism-guided unlearning method, achieves targeted forgetting of RLVR-induced reasoning with minimal collateral damage. On Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, it significantly reduces MATH performance (45/150 to 37/15-0) while preserving GSM8K accuracy by +0.8 points and maintaining MATH retention at -0.5 points. Results hold across seeds, objectives, and models, showing superior stability over full-parameter unlearning.

arxiv arXiv cs.AI · 7d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arxiv arXiv cs.AI · 7d ago

Data Intelligence Agents Enable Autonomous Data Querying

Data Intelligence Agents (DIA) deploy autonomous coding agents to streamline enterprise data workflows. The Query Generator matches or exceeds top published results on seven SQL benchmarks across four dialects, showing generalization through natural-language instructions and execution-based architecture.

arxiv arXiv cs.AI · 7d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.AI · 7d ago

Rubric-Conditioned Self-Distillation Framework

Rubric-Conditioned Self-Distillation introduces a framework that uses structured rubrics to provide fine-grained, token-level feedback during self-distillation of reasoning language models. By conditioning teacher models on rubric-level criteria, it enables more precise credit assignment than scalar rewards, outperforming GRPO and OPSD by 1.0 and 0.9 points on average across science reasoning benchmarks.

arxiv arXiv cs.CL · 7d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

arxiv arXiv cs.CL · 7d ago

Rubric-Conditioned Self-Distillation Framework

arxiv arXiv cs.CL · 7d ago

Turing-RL: Learning User Simulators with Turing Rewards

Turing-RL introduces a reinforcement learning method using an LLM judge to evaluate how indistinguishable generated responses are from real user inputs. It outperforms baseline methods in both LLM and human evaluations across chat and Reddit forum domains, demonstrating that optimizing for indistinguishability improves user simulator performance.

arxiv arXiv cs.LG · 7d ago

TAPO: Self-Distillation with Micro-Reflective Trajectories

TAPO advances self-distillation by constructing explicit micro-reflective trajectories that retain erroneous reasoning and insert natural-language diagnoses. These trajectories, derived from correct and incorrect model rollouts, provide fine-grained error corrections anchored in the model's own reasoning, improving both first-pass reasoning and error correction compared to GRPO.

arxiv arXiv cs.LG · 7d ago

Unsupervised Reward Optimization for Protein Language Models

A new framework enables protein language models to generate controllable protein sequences without labeled data or wet-lab validation. It uses task-agnostic rewards based on model uncertainty and semantic consistency to guide generation, with Soft and Binarized Reward Optimization outperforming baselines in coverage and controllability across diverse conditions.

Claude Code v2.1.181 Release Notes

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

GLM-5.2 Outperforms GPT-5.5 in AA-Briefcase Evaluation

Discriminator-Guided RL Corrects Flow Matching with Data-Aligned Rewards

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

TxBench-PP: AI Agent Performance in Preclinical Pharmacology

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

Zero-Overhead Telemetry Detects Hidden ML Training

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Data Intelligence Agents Enable Autonomous Data Querying

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

Rubric-Conditioned Self-Distillation Framework

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Rubric-Conditioned Self-Distillation Framework

Turing-RL: Learning User Simulators with Turing Rewards

TAPO: Self-Distillation with Micro-Reflective Trajectories

Unsupervised Reward Optimization for Protein Language Models