Safety & alignment — korshunov.ai — ML news

Safety & alignment Page 1 / 11

arxiv arXiv cs.LG · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.

arxiv arXiv cs.LG · 7d ago

Cross-dataset AUC for Realistic Deepfake Detector Evaluation

A new metric, Cross-dataset AUC (Cross-AUC), addresses limitations of traditional AUC evaluations by averaging per-domain AUCs and incorporating prediction polarization via Wasserstein Distance. It better reflects real-world performance under domain shifts and provides interpretable insights into detector degradation.

arxiv arXiv cs.LG · 7d ago

Automated Annotation Framework for Delayed and False AEB Triggers

A new automated system addresses extreme class imbalance and asymmetric label noise in Autonomous Emergency Braking data. It uses targeted data augmentation and noise suppression to identify rare delayed and false triggers with 80% improved recall and 50% reduced manual annotation effort, enabling continuous self-improvement in on-vehicle AEB optimization.

arxiv arXiv cs.LG · 7d ago

Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

A new theory models how semantic paraphrases can fool financial sentiment classifiers by analyzing the worst-case displacement of target model representations. The attackability index λ*(x) is derived from the largest generalised eigenvalue of a matrix pencil (A,B), offering closed-form predictions and robustness certificates for affine readouts. The framework connects continuous perturbation theory to discrete paraphrase search, with empirical validation on real financial text classifiers.

arxiv arXiv cs.LG · 7d ago

Conceptual Innovation in Medical Imaging AI

A new perspective argues that medical imaging AI research should prioritize conceptual innovation—reframing problems, evaluation metrics, and clinical relevance—over algorithmic improvements alone. The article highlights that current academic incentives undervalue conceptual contributions, leading to misaligned objectives and limited real-world impact, and offers recommendations for researchers, mentors, and journals to better support such innovation.

arxiv arXiv cs.LG · 7d ago

Zero-Overhead Telemetry Detects Hidden ML Training

A study evaluates GPU workload classification using only zero-overhead NVML telemetry. The classifier achieves 98.2% accuracy in identifying training workloads and 43-87% accuracy against adversarially disguised, unexpected workloads across 9 GPU models.

arxiv arXiv cs.LG · 7d ago

MC Dropout Uncertainty Alignment Insufficient for Clinical Safety in Glioma Segmentation

A study on 126 BraTS21 patients finds that while MC Dropout achieves strong uncertainty-error alignment, it fails to detect critical calibration issues in enhancing tumour regions. The UNet-Res model shows near-zero entropy and high ECE in these clinically vital areas, with a low Dice score of 0.714, indicating severe miscalibration invisible to standard metrics like Dice and AUROC. These results highlight that uncertainty alignment alone is insufficient for clinical safety and that region-specific calibration must be evaluated alongside standard metrics.

arxiv arXiv cs.AI · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.

arxiv arXiv cs.AI · 7d ago

Taxonomy Links Caregiver Needs to Mental Health Tech

A new taxonomy connects Alzheimer's and dementia caregiver mental health needs with technology interventions. It identifies gaps in support for issues like relational strain and compassion fatigue, and offers a shared framework for designing person-centered, clinically grounded technologies.

arxiv arXiv cs.AI · 7d ago

Self-Correction Boosts Trust in Social Chatbots

A study finds that social chatbots correcting their own errors earn higher user trust and perceived expertise than those relying on external corrections. The strength of user-chatbot social connection enhances belief change only when the chatbot self-corrects, showing that social connection amplifies error correction effectiveness.

arxiv arXiv cs.LG · 7d ago

Detecting Structural Biases via Causal Mechanism Shifts

This paper introduces StruBI, an algorithm that identifies hidden confounding and selection biases by analyzing causal mechanism shifts across environments. It formalizes a mutual information-based criterion to detect structural biases and demonstrates superior performance in recovering biased variables on synthetic and real-world data.

arxiv arXiv cs.LG · 7d ago

QUAM-SM Framework for Uncertainty Quantification in Medical Image Segmentation

QUAM-SM is a post-hoc framework that uses adversarial search to identify 'adversarially fragile' pixels in medical image segmentation. It disentangles epistemic and aleatoric uncertainty and outperforms existing methods in reliability and boundary sensitivity on public datasets with expert annotations.

arxiv arXiv cs.LG · 7d ago

Scaling AEB with Unlabeled Data via Meta-Feedback SSL

A meta-feedback semi-supervised learning framework enables scaling of automatic emergency braking using massive unlabeled fleet data. The stabilized approach reduces pseudo-label errors and suppresses risk hallucinations, achieving a 100:1 positive-to-false activation ratio and 35% more accident-free driving mileage compared to a rule-only baseline in real-world deployment.

arxiv arXiv cs.LG · 7d ago

Feature Selection and Ridge Regularization in Strategic Classification

A study finds that excluding features based on manipulability alone is suboptimal in strategic classification. The research develops a joint algorithm for selecting features and tuning ridge regularization, offering a practical framework to mitigate strategic manipulation in healthcare decision systems.

arxiv arXiv cs.LG · 7d ago

Reward-Free Learning from Perceptual Streams

A new framework enables online reward-punishment learning without environment rewards, using only fixed-channel perceptual packets. It achieves high accuracy in value inference and policy optimization, with B_xi attaining 0.952 balanced reward-sign accuracy and overall policy performance reaching 0.979 optimal-action accuracy in tested tasks, outperforming controls like zero reward and shuffled targets.

arxiv arXiv cs.LG · 7d ago

Positive-Unlabeled Learning for LLM Evaluation Auditing

A new framework uses positive-unlabeled learning and Partial Optimal Transport to audit LLM evaluation biases. It aligns human-verified positive outputs with unlabelled model responses in embedding space, identifying consistent human preferences and correcting verbosity bias without retraining. Experiments show improved human alignment, robustness to presentation biases, and interpretable confidence estimates.

arxiv arXiv cs.LG · 7d ago

Wasserstein Policy Learning for Distributional Outcomes

This paper introduces offline policy learning for distribution-valued outcomes, where rewards are derived from utility functionals applied to Wasserstein barycenters. It establishes statistical guarantees using IPW and DR estimators, proving finite-sample regret with leading dependence \widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N}) and provides a minimax lower bound confirming the sharpness of this rate.

arxiv arXiv cs.LG · 7d ago

XAI reveals key drivers in European electricity markets

A study using SHAP and SSHAP techniques analyzes electricity price drivers across 39 European bidding zones. It finds solar energy has a disproportionate impact on prices, gas remains a dominant factor, and interconnections highlight regional interdependence. The research also builds a synthetic EU-wide market to examine a fully integrated, single-price scenario.

arxiv arXiv cs.LG · 7d ago

Local Population-Risk Certificates for Model Updates

The paper introduces local certificates that provide two-sided confidence bands for population-risk increments around a current model. The upper endpoint of this band defines a risk-controlled update rule: updates are accepted only if the certified upper endpoint is nonpositive, otherwise the current model is retained.

arxiv arXiv cs.LG · 7d ago

OpenAnt: LLM-Powered Vulnerability Discovery System

OpenAnt uses code decomposition, adversarial verification, and dynamic testing to identify vulnerabilities in large codebases. It reduces analysis surface by up to 97% and cuts false positives while validating findings through automated, sandboxed execution. Evaluated on OpenSSL, WordPress, and Flowise, it discovers previously unknown vulnerabilities with manageable cost and scalability.