Safety & alignment
arxiv arXiv cs.CL · 2d ago

Cognitive Digital Twins: Ethical Risks and Governance

Cognitive digital twins (CDTs) are dynamic computational models of individual cognition, updated from personal data to simulate or act on behalf of users. This paper introduces a 5A governance framework—authority, autonomy, access and control, accountability, and availability—to address ethical risks like misrepresentation, proxy-power asymmetries, and shadow twins, emphasizing the need for governance over cognitive representation itself, not just decision-making or data use.

lab Cohere Blog · 2d ago

AI's Cultural Gaps Expose Global Users to Misrepresentation and Marginalization

A global survey of 81 AI users from 22 countries found that 89.5% of non-English speakers switch to English when using AI, citing perceived accuracy. Over one-third reported AI fails to understand their cultures, with 63% experiencing violations of cultural norms, including Western-centric narratives and inappropriate formality. Participants expressed concern that AI will further marginalize their cultures, with 67% agreeing AI will reduce cultural diversity to stereotypes in the future.

arxiv arXiv cs.CL · 2d ago

Uncertainty-Based Decontamination for LLM Decontamination

We propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles to estimate per-sample memorization in contaminated models without needing an uncontaminated model. UBD constructs a debiased target distribution from ensemble uncertainty to correct output distributions, achieving significantly better alignment with uncontaminated models compared to baselines, while maintaining performance on clean data.

arxiv arXiv cs.CL · 2d ago

TF-RefusalBench Measures Over-Alignment in LLMs for Criminal Law

TF-RefusalBench is a multilingual benchmark derived from Swiss Supreme Court rulings, containing 5,200 prompts in French, German, Italian, and English. It reveals that over-alignment in LLMs is influenced by model and language factors, and that refusals impact task faithfulness beyond simple refusal rates. Abliteration of refusal directives reduces over-alignment with minimal performance loss in criminal law tasks.

media r/LocalLLaMA · 2d ago

EU AI Act mandates AI-generated text watermarking from August 2024

The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.

arxiv arXiv cs.CL · 2d ago

Validation-Gated Mechanistic Analysis of Suicidality Detection in LLMs

A validation-gated framework evaluates LLM internal features only after observed behavior, revealing a mid-network feature that causally contributes to suicide detection. This feature is semantic, low-rank, cross-model, and specific to suicidality over general distress, though steering is necessary but not sufficient. The pattern shows smaller models encode suicidality but only larger ones act on it, with evidence limited to English Reddit text.

arxiv arXiv cs.CL · 2d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.