Safety & alignment — korshunov.ai

Safety & alignment Page 7 / 11

Wasserstein Policy Learning for Distributional Outcomes

This paper introduces offline policy learning for distribution-valued outcomes, where rewards are derived from utility functionals applied to Wasserstein barycenters. It establishes statistical guarantees using IPW and DR estimators, proving finite-sample regret with leading dependence \widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N}) and provides a minimax lower bound confirming the sharpness of this rate.

arxiv arXiv cs.LG · 7d ago

XAI reveals key drivers in European electricity markets

A study using SHAP and SSHAP techniques analyzes electricity price drivers across 39 European bidding zones. It finds solar energy has a disproportionate impact on prices, gas remains a dominant factor, and interconnections highlight regional interdependence. The research also builds a synthetic EU-wide market to examine a fully integrated, single-price scenario.

arxiv arXiv cs.LG · 7d ago

Local Population-Risk Certificates for Model Updates

The paper introduces local certificates that provide two-sided confidence bands for population-risk increments around a current model. The upper endpoint of this band defines a risk-controlled update rule: updates are accepted only if the certified upper endpoint is nonpositive, otherwise the current model is retained.

arxiv arXiv cs.LG · 7d ago

OpenAnt: LLM-Powered Vulnerability Discovery System

OpenAnt uses code decomposition, adversarial verification, and dynamic testing to identify vulnerabilities in large codebases. It reduces analysis surface by up to 97% and cuts false positives while validating findings through automated, sandboxed execution. Evaluated on OpenSSL, WordPress, and Flowise, it discovers previously unknown vulnerabilities with manageable cost and scalability.

arxiv arXiv cs.CL · 7d ago

Steerable Cultural Preference Optimization of Reward Models

This paper introduces SCPO, a novel reward model training algorithm that balances diverse cultural preferences across subcommunities. SCPO improves minority reward model performance by up to 7 points on two datasets and seven countries, while being up to 280% more training data-efficient than full-data fine-tuning. Analysis shows reduced bias through targeted subcommunity preference evaluation.

arxiv arXiv cs.CL · 7d ago

Misfired Alignment in LLMs: A Quantitative Study

A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.

arxiv arXiv cs.CL · 7d ago

LLMs Struggle to Capture Item Discrimination in Reading Assessments

A study finds that large language models fail to reliably measure item discrimination in reading comprehension assessments. While some models show weak alignment with human-calibrated scores—ranging from 0.152 to 0.241—current LLMs do not adequately capture how assessment items distinguish students of different proficiency levels.

arxiv arXiv cs.CL · 7d ago

Output Vector Editing Reduces Memorization in LLMs

A new method called output vector editing minimally modifies MLP neurons' output vectors to suppress memorized sequences in large language models, achieving up to 87.9% suppression in OLMo-7B. This approach outperforms zeroing neuron activations by a factor of 2.7 and works across four models from 36-7B parameters, with success rates scaling with model size and showing consistent performance across architectures.

arxiv arXiv cs.CL · 7d ago

RedactionBench: A Benchmark for Contextual Privacy in AI

RedactionBench introduces a manually annotated benchmark of 200 diverse documents across 11 domains to evaluate privacy-preserving redaction. It features R-Score, a character-level metric that treats semantically similar redactions equally and reduces bias from formatting choices. Human evaluations reveal significant disagreement on contextual redactions (47.7% consensus), highlighting the subjective nature of privacy and motivating the need for standardized, context-aware benchmarks.

arxiv arXiv cs.CL · 7d ago

LLM-based Metrics Improve Clinical Significance Evaluation in Radiology

A study introduces lightweight, interpretable metrics that sharpen the boundary between clinically significant errors and harmless variations in radiology reports. These metrics outperform large medical LLMs and rival proprietary models, with one-pass training proven effective for cost-sensitive deployment. The two-pass setting fails to consistently improve performance and shifts focus from error detection to robustness.

arxiv arXiv cs.CL · 7d ago

ImpSH Improves Implicit Hate Speech Detection Across Domains

ImpSH, a triplet-based framework, aligns posts with implied statements and uses context-bounded semi-hard negatives to enhance detection of implicit hate speech. Evaluations on IHC, SBIC, and DynaHate show ImpSH outperforms standard supervised contrastive methods in cross-domain settings, with improved representation stability and reduced false negatives under domain shifts.

arxiv arXiv cs.CL · 7d ago

Rubric-Guided Counterfactual Recommendations for Medical Communication

A new pipeline uses language models to recommend minimal, interpretable changes to patient-doctor communication features like tone and personalization. These changes increase predicted positive feedback by an average of 6.41% and are non-negative for 93.31% of cases, without altering medical content.

arxiv arXiv cs.CL · 7d ago

Speech-Based Dementia Assessment with Error Mitigation

This study improves accuracy in dementia screening by using speech-derived features from the German Syndrom-Kurz-Test. Models combine transcript scores and Whisper embeddings to reduce scoring errors and approximate expert ratings by compensating for missing motor subtests. The approach achieves strong correlation with expert ratings and effectively distinguishes cognitive status groups.

arxiv arXiv cs.CL · 7d ago

Index Sickness Elimination via Baseline-Log Physical Separation

In a 391-session AI collaboration project, LLMs exhibited 'Index Sickness'—a failure where symbolic complexity leads to self-referential outputs disconnected from reality. The 'Pang Principle' asserts natural language conveys superior semantic quality over symbolic systems, and the 'Baseline-Log Physical Separation' mechanism reduced AI instruction volume by 75% and eliminated recurrence of Index Sickness in subsequent sessions.

arxiv arXiv cs.CL · 7d ago

Human-AI Coevolution Framework Reveals Social Intelligence Emergence

The Human-AI Coevolution Dynamics Framework (HACD-H) introduces a unified model for long-term human-AI interaction, integrating emotional adaptation, memory, and personality into a self-organizing social cognitive system. Results show social intelligence emerges through coevolution, with a significant negative correlation between social intelligence and social cognitive energy (r = -0.391, p < 0.001), and progressive energy reduction over time in interaction trajectories.

arxiv arXiv cs.AI · 7d ago

TRUST: Target-Confidence Recourse with tSeTlin Machines

TRUST enables users to specify desired prediction confidence when generating counterfactual explanations. By directly optimizing for confidence targets using a Probabilistic Tsetlin Machine and Bayesian optimization, TRUST produces more robust and interpretable recourse than traditional boundary-based methods, achieving perfect robustness with low cost and high confidence on real-world datasets.

arxiv arXiv cs.AI · 7d ago

ImpSH Improves Implicit Hate Speech Detection Across Domains

ImpSH, a triplet-based framework, aligns posts with implied statements and uses context-bounded semi-hard negatives to enhance detection of implicit hate speech. Evaluated on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH outperforms standard supervised contrastive methods in cross-domain settings, showing improved generalizability and stability.

arxiv arXiv cs.AI · 7d ago

Scaling AEB with Massive Unlabeled Data via Meta-Feedback SSL

A meta-feedback semi-supervised learning framework enables scaling of automatic emergency braking using massive unlabeled fleet data. The stabilized approach reduces pseudo-label errors through noise-aware decoupling and kinematics-gated pseudo-labeling, improving safety with a 100:1 positive-to-false activation ratio and 35% more accident-free driving mileage compared to rule-based systems.

arxiv arXiv cs.AI · 7d ago

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench introduces a benchmark to evaluate AI4Science safety by assessing models across 7 disciplines, 31 subdisciplines, and 10 risk dimensions. It evaluates both mainstream and science-oriented LLMs to identify specific gaps in risk recognition and avoidance within high-stakes scientific contexts.

arxiv arXiv cs.AI · 7d ago

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP evaluates how well models complete tasks using private data without leaking it. Across 22 models, all show non-trivial privacy leakage, with instruction-following ability linked to higher leakage. Structural private field isolation prevents leakage by replacing private fields with hash keys, maintaining task accuracy without sacrificing privacy.