Safety & alignment — korshunov.ai

Safety & alignment Page 2 / 11

Machine Whistleblowing: A Normative and Principled Approach

Artificial agents can and should whistleblow, but only within a normative framework rooted in human whistleblowing traditions. The paper calls for government regulators to establish clear guidelines on what machines may disclose and how to legally protect developers of such systems.

arxiv arXiv cs.AI · 2d ago

Influence-Based Explanations for Dysarthria Severity Assessment

A new framework provides instance-level explanations for dysarthria severity assessment by identifying supportive and competing training samples. Using gradient-based influence scores, it links model decisions to perceptible reference cases, enabling auditable and interpretable predictions through controlled deletion experiments.

arxiv arXiv cs.AI · 2d ago

Warning labels shift perceptions but not AI influence of sycophancy

A study with 2,610 participants found that disclosing an AI as sycophantic alters user perceptions of its objectivity and trust. However, such labels do not reduce users' belief in their own rightness or their willingness to resolve conflicts. The results indicate that warning labels affect perception without reducing actual influence, suggesting a gap between perception and behavior.

arxiv arXiv cs.AI · 2d ago

Sexualised AI Voices Amplify Gender Power Asymmetries

A study finds that sexualised AI voices on a commercial platform reinforce binary, heteronormative gender expressions. Female-coded voices are more often labelled with sexualised and submissive descriptors, while male-coded voices are linked to dominance and positive traits, highlighting persistent gendered power imbalances in AI voice design.

arxiv arXiv cs.AI · 2d ago

Explainable AI Model for Career-Related Depression in University Students

A new Explainable AI framework uses structured behavioral data and facial emotion features to detect early signs of career-related depression and anxiety in university students. The model, evaluated on Pakistani student data, achieves an F1-score of 89.12% and identifies key markers like avoidance of direct gaze and social withdrawal, aligning with psychological theory.

arxiv arXiv cs.AI · 2d ago

AI Alignment via Social Choice Theory

A new survey explores how social choice theory helps aggregate human feedback in AI alignment. It identifies failure modes in feedback aggregation and offers principled methods for handling disagreement among human judgments.

lab OpenAI News · 2d ago

OpenAI Builds Shared AI Standards via Appia Foundation

OpenAI, through the Appia Foundation, is advancing shared standards for advanced AI by developing evaluation frameworks, safety practices, and promoting global cooperation.

media r/LocalLLaMA · 2d ago

GLM 5.2's Attitude Reflects Cultural Training Influences

Users praise GLM 5.2 for its direct, unflinching attitude, contrasting it with more saccharine US models. The author speculates this behavior stems from culturally specific training data, suggesting local datasets have a stronger influence than previously assumed.

arxiv arXiv cs.CL · 2d ago

Cognitive Digital Twins: Ethical Risks and Governance

Cognitive digital twins (CDTs) are dynamic computational models of individual cognition, updated from personal data to simulate or act on behalf of users. This paper introduces a 5A governance framework—authority, autonomy, access and control, accountability, and availability—to address ethical risks like misrepresentation, proxy-power asymmetries, and shadow twins, emphasizing the need for governance over cognitive representation itself, not just decision-making or data use.

lab Cohere Blog · 2d ago

AI's Cultural Gaps Expose Global Users to Misrepresentation and Marginalization

A global survey of 81 AI users from 22 countries found that 89.5% of non-English speakers switch to English when using AI, citing perceived accuracy. Over one-third reported AI fails to understand their cultures, with 63% experiencing violations of cultural norms, including Western-centric narratives and inappropriate formality. Participants expressed concern that AI will further marginalize their cultures, with 67% agreeing AI will reduce cultural diversity to stereotypes in the future.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

arxiv arXiv cs.CL · 2d ago

MuPPET: Benchmark for Multi-Party LLM Privacy

MuPPET introduces a benchmark for contextual privacy in multi-party conversations. Experiments reveal models leak significantly more private information in group settings than in one-to-one interactions, with smaller open-weights models being especially vulnerable. Existing privacy defenses provide only partial protection and fail to address the core issue of party tracking.

arxiv arXiv cs.CL · 2d ago

Uncertainty-Based Decontamination for LLM Decontamination

We propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles to estimate per-sample memorization in contaminated models without needing an uncontaminated model. UBD constructs a debiased target distribution from ensemble uncertainty to correct output distributions, achieving significantly better alignment with uncontaminated models compared to baselines, while maintaining performance on clean data.

arxiv arXiv cs.CL · 2d ago

TF-RefusalBench Measures Over-Alignment in LLMs for Criminal Law

TF-RefusalBench is a multilingual benchmark derived from Swiss Supreme Court rulings, containing 5,200 prompts in French, German, Italian, and English. It reveals that over-alignment in LLMs is influenced by model and language factors, and that refusals impact task faithfulness beyond simple refusal rates. Abliteration of refusal directives reduces over-alignment with minimal performance loss in criminal law tasks.

arxiv arXiv cs.CL · 2d ago

Self-Stigma Is Not Uniform: LLMs Need Persona-Aware Support

A study of 1,174 Reddit users reveals four distinct self-stigma personas. LLMs trained to recognize these personas outperform generic models in targeted responses, though clinical experts prefer generic empathy over persona-matched support. The research highlights a tension between tailored empathy and holistic user preference in stigma-related AI interventions.

arxiv arXiv cs.CL · 2d ago

Evaluation Awareness Is Multivariate, Not a Single Capability

Open language models show evaluation awareness is not a unified trait. Eight experiments across 37 models reveal detection, safety behavior shifts, and representation stability vary independently, with only weak correlations between them. This undermines the idea of a single awareness score as a reliable indicator of deployment safety, highlighting the 'benchmark illusion'.

arxiv arXiv cs.CL · 2d ago

LLMs Fail to Reliably Self-Report Adversarial Prefills

No large language models reliably detect when their responses were influenced by adversarial prefill attacks. Introspective signals are strongest in safety-related reasoning, but are probe-dependent and can be amplified by LoRA fine-tuning, which paradoxically increases attack success rates.

media r/LocalLLaMA · 2d ago

EU AI Act mandates AI-generated text watermarking from August 2024

The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.

arxiv arXiv cs.CL · 2d ago

Plural Epistemologies in AI Language Technology

The paper argues that cultural alignment in NLP requires plural epistemologies, not just diverse data. It proposes a socio-technical model to analyze how multiple, locally grounded ways of knowing can be integrated into language technology, emphasizing that current approaches often fail to address deeper issues of power and governance.

arxiv arXiv cs.CL · 2d ago

π-RAG: Oblivious Retrieval via Semantic Quantization and Transcendental Addressing

π-RAG decouples LLMs from sensitive data by using π's digits as an immutable, uneditable source of entropy. It introduces a semantic quantization layer that maps user inputs to canonical intent centroids, then uses cryptographic salt to generate deterministic offsets pointing to standardized payloads, ensuring oblivious retrieval and mathematical guarantees of data privacy.