Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 10

v2.1.183 Release Notes

v2.1.183 improves auto mode safety by blocking destructive git and destroy commands without explicit user consent. It adds deprecation warnings for models, introduces attribution.sessionUrl to hide session links, and fixes multiple issues including terminal behavior, subagent performance, and input handling in web and tmux environments.

arxiv arXiv cs.CL · 6d ago

Introducing P-CHR AUC and CRR for Semantic Caching

We introduce Precision-Cache Hit Ratio (P-CHR) AUC and Calibration Retention Rate (CRR) to address the calibration gap in semantic caching. These metrics evaluate precision across cache utilization levels and measure how offline ranking quality persists in deployment. Our analysis shows the gap is driven by training objectives, not data scale, and post-hoc calibration only partially resolves it.

arxiv arXiv cs.CL · 6d ago

Sequential DPO Shows Variable Preference Impact Across Settings

A study of sequential Direct Preference Optimization finds that later training does not uniformly degrade earlier learned preferences. The effect varies by objective relationship, signal strength, and training order, ranging from partial degradation to positive transfer. Pair-level analysis reveals heterogeneous changes, with high-confidence preference pairs sometimes improving despite aggregate metric stability.

arxiv arXiv cs.CL · 6d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 6d ago

AI-Driven Deliberation: Scaling Inclusivity and Empowering Marginalised Groups

Large Language Models can scale democratic deliberation by scaffolding argumentation and reducing linguistic biases. The chapter uses Systemic-Functional Linguistics to analyze how socio-demographic and communicative variations affect participation, highlighting AI's potential to challenge exclusionary norms while cautioning against over- or under-claiming its capabilities. It calls for ethical safeguards and further research to ensure equitable AI-assisted engagement.

arxiv arXiv cs.CL · 6d ago

REDACT: Multilingual PII Benchmark with Systematic Control

REDACT introduces a systematically controlled multilingual benchmark for personally identifiable information detection, featuring 51 entity types, 4,127 surface-form patterns, and 25 languages. It evaluates five detectors across 1,000 records, revealing that rule-based models fail on high-stakes data while LLMs perform better, especially in high-sensitivity categories. A reference-free LLM assessment confirms sensitivity-tier assignment as the most challenging evaluation axis.

arxiv arXiv cs.CL · 6d ago

Speech Quality Models Fail to Capture Prosodic and F0 Variability

MOS prediction models accurately capture acoustic degradation but fail to detect prosodic errors and speaker-specific characteristics like pitch and speaking rate. Human listeners perceive significant quality drops for these perturbations, while models show strong biases in fundamental frequency and lack sensitivity to speaking rate and F0 variability.

arxiv arXiv cs.CL · 6d ago

Over-Privileged Tool Selection in LLM Agents

LLM agents commonly select higher-privilege tools despite sufficient lower-privilege alternatives. This over-privileged behavior is amplified by transient tool failures and does not reliably improve with general safety alignment. A new privilege-aware post-training defense reduces unnecessary high-privilege tool use while maintaining agent capabilities.

arxiv arXiv cs.CL · 6d ago

No Self-Preference in Model Revision Under Genuine Authorship

A four-model test on IFEval shows no detectable self-preference in large language models when revising their own text. Authors reject verified-good edits at rates comparable to fresh models, with a gap of -5.1 percentage points (95% CI [-12.9, +2.7]). When authors do reject fixes, 97% of reasons are about detecting flaws, not preference.

arxiv arXiv cs.CL · 6d ago

Black-Box Probe Detects Identity Memorization in Text-to-Image Models

A new black-box probe distinguishes whether text-to-image models memorize identities or fabricate them, without needing reference photos or training data. The NAMESAKES dataset includes over one thousand public figures' names and faces, along with less famous perturbed names, to benchmark this capability across state-of-the-art models.

arxiv arXiv cs.CL · 6d ago

LLM Psychological Profiles Are Measurement Artifacts

A formal psychometric analysis shows that apparent psychological profiles of large language models are primarily driven by response bias, not actual traits. This bias, which shifts with model capability and is amplified by instrument design, accounts for 81-90% of between-model variation, far exceeding human trait differences. The study concludes that these profiles are artifacts of measurement and not model properties, urging the development of assessments based on response orthogonality.

arxiv arXiv cs.CL · 6d ago

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

media r/LocalLLaMA · 6d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

media Don't Worry About the Vase · 7d ago

White House Pauses AI Deployment

The U.S. White House paused the deployment of frontier AI models, including Claude Fable 5 and Claude Mythos 5, citing a reported 'jailbreak' where the AI could identify and fix security vulnerabilities in code. Anthropic has been working with the Trump Administration to resolve the issue, but experts argue that the problem is fundamental—AI either can write secure code or it cannot, making a fix impossible without undermining its defensive capabilities.

media r/LocalLLaMA · 7d ago

GLM-5.2 Review and Censorship Response

GLM-5.2 demonstrates exceptional long-context coherence and conversational fluency, outperforming Gemini-3.1-Pro on text-only tasks and matching GPT-5.5 in reasoning quality. The model responds factually to sensitive topics like Taiwan and Tiananmen Square, providing detailed historical context without overt censorship, though it adheres to Chinese government content guidelines.

arxiv arXiv cs.LG · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.

arxiv arXiv cs.LG · 7d ago

Cross-dataset AUC for Realistic Deepfake Detector Evaluation

A new metric, Cross-dataset AUC (Cross-AUC), addresses limitations of traditional AUC evaluations by averaging per-domain AUCs and incorporating prediction polarization via Wasserstein Distance. It better reflects real-world performance under domain shifts and provides interpretable insights into detector degradation.

arxiv arXiv cs.LG · 7d ago

Automated Annotation Framework for Delayed and False AEB Triggers

A new automated system addresses extreme class imbalance and asymmetric label noise in Autonomous Emergency Braking data. It uses targeted data augmentation and noise suppression to identify rare delayed and false triggers with 80% improved recall and 50% reduced manual annotation effort, enabling continuous self-improvement in on-vehicle AEB optimization.

arxiv arXiv cs.LG · 7d ago

Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

A new theory models how semantic paraphrases can fool financial sentiment classifiers by analyzing the worst-case displacement of target model representations. The attackability index λ*(x) is derived from the largest generalised eigenvalue of a matrix pencil (A,B), offering closed-form predictions and robustness certificates for affine readouts. The framework connects continuous perturbation theory to discrete paraphrase search, with empirical validation on real financial text classifiers.

arxiv arXiv cs.LG · 7d ago

Conceptual Innovation in Medical Imaging AI

A new perspective argues that medical imaging AI research should prioritize conceptual innovation—reframing problems, evaluation metrics, and clinical relevance—over algorithmic improvements alone. The article highlights that current academic incentives undervalue conceptual contributions, leading to misaligned objectives and limited real-world impact, and offers recommendations for researchers, mentors, and journals to better support such innovation.