Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 10

Zero-Overhead Telemetry Detects Hidden ML Training

A study evaluates GPU workload classification using only zero-overhead NVML telemetry. The classifier achieves 98.2% accuracy in identifying training workloads and 43-87% accuracy against adversarially disguised, unexpected workloads across 9 GPU models.

arxiv arXiv cs.LG · 7d ago

MC Dropout Uncertainty Alignment Insufficient for Clinical Safety in Glioma Segmentation

A study on 126 BraTS21 patients finds that while MC Dropout achieves strong uncertainty-error alignment, it fails to detect critical calibration issues in enhancing tumour regions. The UNet-Res model shows near-zero entropy and high ECE in these clinically vital areas, with a low Dice score of 0.714, indicating severe miscalibration invisible to standard metrics like Dice and AUROC. These results highlight that uncertainty alignment alone is insufficient for clinical safety and that region-specific calibration must be evaluated alongside standard metrics.

arxiv arXiv cs.AI · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.

arxiv arXiv cs.AI · 7d ago

Taxonomy Links Caregiver Needs to Mental Health Tech

A new taxonomy connects Alzheimer's and dementia caregiver mental health needs with technology interventions. It identifies gaps in support for issues like relational strain and compassion fatigue, and offers a shared framework for designing person-centered, clinically grounded technologies.

arxiv arXiv cs.AI · 7d ago

Self-Correction Boosts Trust in Social Chatbots

A study finds that social chatbots correcting their own errors earn higher user trust and perceived expertise than those relying on external corrections. The strength of user-chatbot social connection enhances belief change only when the chatbot self-corrects, showing that social connection amplifies error correction effectiveness.

arxiv arXiv cs.LG · 7d ago

Detecting Structural Biases via Causal Mechanism Shifts

This paper introduces StruBI, an algorithm that identifies hidden confounding and selection biases by analyzing causal mechanism shifts across environments. It formalizes a mutual information-based criterion to detect structural biases and demonstrates superior performance in recovering biased variables on synthetic and real-world data.

arxiv arXiv cs.LG · 7d ago

QUAM-SM Framework for Uncertainty Quantification in Medical Image Segmentation

QUAM-SM is a post-hoc framework that uses adversarial search to identify 'adversarially fragile' pixels in medical image segmentation. It disentangles epistemic and aleatoric uncertainty and outperforms existing methods in reliability and boundary sensitivity on public datasets with expert annotations.

arxiv arXiv cs.LG · 7d ago

Scaling AEB with Unlabeled Data via Meta-Feedback SSL

A meta-feedback semi-supervised learning framework enables scaling of automatic emergency braking using massive unlabeled fleet data. The stabilized approach reduces pseudo-label errors and suppresses risk hallucinations, achieving a 100:1 positive-to-false activation ratio and 35% more accident-free driving mileage compared to a rule-only baseline in real-world deployment.

arxiv arXiv cs.LG · 7d ago

Feature Selection and Ridge Regularization in Strategic Classification

A study finds that excluding features based on manipulability alone is suboptimal in strategic classification. The research develops a joint algorithm for selecting features and tuning ridge regularization, offering a practical framework to mitigate strategic manipulation in healthcare decision systems.

arxiv arXiv cs.LG · 7d ago

Reward-Free Learning from Perceptual Streams

A new framework enables online reward-punishment learning without environment rewards, using only fixed-channel perceptual packets. It achieves high accuracy in value inference and policy optimization, with B_xi attaining 0.952 balanced reward-sign accuracy and overall policy performance reaching 0.979 optimal-action accuracy in tested tasks, outperforming controls like zero reward and shuffled targets.

arxiv arXiv cs.LG · 7d ago

Positive-Unlabeled Learning for LLM Evaluation Auditing

A new framework uses positive-unlabeled learning and Partial Optimal Transport to audit LLM evaluation biases. It aligns human-verified positive outputs with unlabelled model responses in embedding space, identifying consistent human preferences and correcting verbosity bias without retraining. Experiments show improved human alignment, robustness to presentation biases, and interpretable confidence estimates.

arxiv arXiv cs.LG · 7d ago

Wasserstein Policy Learning for Distributional Outcomes

This paper introduces offline policy learning for distribution-valued outcomes, where rewards are derived from utility functionals applied to Wasserstein barycenters. It establishes statistical guarantees using IPW and DR estimators, proving finite-sample regret with leading dependence \widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N}) and provides a minimax lower bound confirming the sharpness of this rate.

arxiv arXiv cs.LG · 7d ago

XAI reveals key drivers in European electricity markets

A study using SHAP and SSHAP techniques analyzes electricity price drivers across 39 European bidding zones. It finds solar energy has a disproportionate impact on prices, gas remains a dominant factor, and interconnections highlight regional interdependence. The research also builds a synthetic EU-wide market to examine a fully integrated, single-price scenario.

arxiv arXiv cs.LG · 7d ago

Local Population-Risk Certificates for Model Updates

The paper introduces local certificates that provide two-sided confidence bands for population-risk increments around a current model. The upper endpoint of this band defines a risk-controlled update rule: updates are accepted only if the certified upper endpoint is nonpositive, otherwise the current model is retained.

arxiv arXiv cs.LG · 7d ago

OpenAnt: LLM-Powered Vulnerability Discovery System

OpenAnt uses code decomposition, adversarial verification, and dynamic testing to identify vulnerabilities in large codebases. It reduces analysis surface by up to 97% and cuts false positives while validating findings through automated, sandboxed execution. Evaluated on OpenSSL, WordPress, and Flowise, it discovers previously unknown vulnerabilities with manageable cost and scalability.

arxiv arXiv cs.CL · 7d ago

Steerable Cultural Preference Optimization of Reward Models

This paper introduces SCPO, a novel reward model training algorithm that balances diverse cultural preferences across subcommunities. SCPO improves minority reward model performance by up to 7 points on two datasets and seven countries, while being up to 280% more training data-efficient than full-data fine-tuning. Analysis shows reduced bias through targeted subcommunity preference evaluation.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 7d ago

LLMs Struggle to Capture Item Discrimination in Reading Assessments

A study finds that large language models fail to reliably measure item discrimination in reading comprehension assessments. While some models show weak alignment with human-calibrated scores—ranging from 0.152 to 0.241—current LLMs do not adequately capture how assessment items distinguish students of different proficiency levels.

arxiv arXiv cs.CL · 7d ago

Output Vector Editing Reduces Memorization in LLMs

A new method called output vector editing minimally modifies MLP neurons' output vectors to suppress memorized sequences in large language models, achieving up to 87.9% suppression in OLMo-7B. This approach outperforms zeroing neuron activations by a factor of 2.7 and works across four models from 36-7B parameters, with success rates scaling with model size and showing consistent performance across architectures.

arxiv arXiv cs.CL · 7d ago

RedactionBench: A Benchmark for Contextual Privacy in AI

RedactionBench introduces a manually annotated benchmark of 200 diverse documents across 11 domains to evaluate privacy-preserving redaction. It features R-Score, a character-level metric that treats semantically similar redactions equally and reduces bias from formatting choices. Human evaluations reveal significant disagreement on contextual redactions (47.7% consensus), highlighting the subjective nature of privacy and motivating the need for standardized, context-aware benchmarks.