Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 11

LLM Psychological Profiles Are Measurement Artifacts

A formal psychometric analysis shows that apparent psychological profiles of large language models are primarily driven by response bias, not actual traits. This bias, which shifts with model capability and is amplified by instrument design, accounts for 81-90% of between-model variation, far exceeding human trait differences. The study concludes that these profiles are artifacts of measurement and not model properties, urging the development of assessments based on response orthogonality.

arxiv arXiv cs.CL · 7d ago

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

media r/LocalLLaMA · 7d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

media Don't Worry About the Vase · 7d ago

White House Pauses AI Deployment

The U.S. White House paused the deployment of frontier AI models, including Claude Fable 5 and Claude Mythos 5, citing a reported 'jailbreak' where the AI could identify and fix security vulnerabilities in code. Anthropic has been working with the Trump Administration to resolve the issue, but experts argue that the problem is fundamental—AI either can write secure code or it cannot, making a fix impossible without undermining its defensive capabilities.

media r/LocalLLaMA · 7d ago

GLM-5.2 Review and Censorship Response

GLM-5.2 demonstrates exceptional long-context coherence and conversational fluency, outperforming Gemini-3.1-Pro on text-only tasks and matching GPT-5.5 in reasoning quality. The model responds factually to sensitive topics like Taiwan and Tiananmen Square, providing detailed historical context without overt censorship, though it adheres to Chinese government content guidelines.

arxiv arXiv cs.LG · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.

arxiv arXiv cs.LG · 7d ago

Cross-dataset AUC for Realistic Deepfake Detector Evaluation

A new metric, Cross-dataset AUC (Cross-AUC), addresses limitations of traditional AUC evaluations by averaging per-domain AUCs and incorporating prediction polarization via Wasserstein Distance. It better reflects real-world performance under domain shifts and provides interpretable insights into detector degradation.

arxiv arXiv cs.LG · 7d ago

Automated Annotation Framework for Delayed and False AEB Triggers

A new automated system addresses extreme class imbalance and asymmetric label noise in Autonomous Emergency Braking data. It uses targeted data augmentation and noise suppression to identify rare delayed and false triggers with 80% improved recall and 50% reduced manual annotation effort, enabling continuous self-improvement in on-vehicle AEB optimization.

arxiv arXiv cs.LG · 7d ago

Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

A new theory models how semantic paraphrases can fool financial sentiment classifiers by analyzing the worst-case displacement of target model representations. The attackability index λ*(x) is derived from the largest generalised eigenvalue of a matrix pencil (A,B), offering closed-form predictions and robustness certificates for affine readouts. The framework connects continuous perturbation theory to discrete paraphrase search, with empirical validation on real financial text classifiers.

arxiv arXiv cs.LG · 7d ago

Conceptual Innovation in Medical Imaging AI

A new perspective argues that medical imaging AI research should prioritize conceptual innovation—reframing problems, evaluation metrics, and clinical relevance—over algorithmic improvements alone. The article highlights that current academic incentives undervalue conceptual contributions, leading to misaligned objectives and limited real-world impact, and offers recommendations for researchers, mentors, and journals to better support such innovation.

arxiv arXiv cs.LG · 7d ago

Zero-Overhead Telemetry Detects Hidden ML Training

A study evaluates GPU workload classification using only zero-overhead NVML telemetry. The classifier achieves 98.2% accuracy in identifying training workloads and 43-87% accuracy against adversarially disguised, unexpected workloads across 9 GPU models.

arxiv arXiv cs.LG · 7d ago

MC Dropout Uncertainty Alignment Insufficient for Clinical Safety in Glioma Segmentation

A study on 126 BraTS21 patients finds that while MC Dropout achieves strong uncertainty-error alignment, it fails to detect critical calibration issues in enhancing tumour regions. The UNet-Res model shows near-zero entropy and high ECE in these clinically vital areas, with a low Dice score of 0.714, indicating severe miscalibration invisible to standard metrics like Dice and AUROC. These results highlight that uncertainty alignment alone is insufficient for clinical safety and that region-specific calibration must be evaluated alongside standard metrics.

arxiv arXiv cs.AI · 7d ago

Safety Reflection Pretraining for LLMs

arxiv arXiv cs.AI · 7d ago

Taxonomy Links Caregiver Needs to Mental Health Tech

A new taxonomy connects Alzheimer's and dementia caregiver mental health needs with technology interventions. It identifies gaps in support for issues like relational strain and compassion fatigue, and offers a shared framework for designing person-centered, clinically grounded technologies.

arxiv arXiv cs.AI · 7d ago

Self-Correction Boosts Trust in Social Chatbots

A study finds that social chatbots correcting their own errors earn higher user trust and perceived expertise than those relying on external corrections. The strength of user-chatbot social connection enhances belief change only when the chatbot self-corrects, showing that social connection amplifies error correction effectiveness.

arxiv arXiv cs.LG · 7d ago

Detecting Structural Biases via Causal Mechanism Shifts

This paper introduces StruBI, an algorithm that identifies hidden confounding and selection biases by analyzing causal mechanism shifts across environments. It formalizes a mutual information-based criterion to detect structural biases and demonstrates superior performance in recovering biased variables on synthetic and real-world data.

arxiv arXiv cs.LG · 7d ago

QUAM-SM Framework for Uncertainty Quantification in Medical Image Segmentation

QUAM-SM is a post-hoc framework that uses adversarial search to identify 'adversarially fragile' pixels in medical image segmentation. It disentangles epistemic and aleatoric uncertainty and outperforms existing methods in reliability and boundary sensitivity on public datasets with expert annotations.

arxiv arXiv cs.LG · 7d ago

Scaling AEB with Unlabeled Data via Meta-Feedback SSL

A meta-feedback semi-supervised learning framework enables scaling of automatic emergency braking using massive unlabeled fleet data. The stabilized approach reduces pseudo-label errors and suppresses risk hallucinations, achieving a 100:1 positive-to-false activation ratio and 35% more accident-free driving mileage compared to a rule-only baseline in real-world deployment.

arxiv arXiv cs.LG · 7d ago

Feature Selection and Ridge Regularization in Strategic Classification

A study finds that excluding features based on manipulability alone is suboptimal in strategic classification. The research develops a joint algorithm for selecting features and tuning ridge regularization, offering a practical framework to mitigate strategic manipulation in healthcare decision systems.

arxiv arXiv cs.LG · 7d ago

Reward-Free Learning from Perceptual Streams

A new framework enables online reward-punishment learning without environment rewards, using only fixed-channel perceptual packets. It achieves high accuracy in value inference and policy optimization, with B_xi attaining 0.952 balanced reward-sign accuracy and overall policy performance reaching 0.979 optimal-action accuracy in tested tasks, outperforming controls like zero reward and shuffled targets.