Safety & alignment — korshunov.ai — ML news

Safety & alignment Page 1 / 10

lab OpenAI News · 1d ago

OpenAI Builds Shared AI Standards via Appia Foundation

OpenAI, through the Appia Foundation, is advancing shared standards for advanced AI by developing evaluation frameworks, safety practices, and promoting global cooperation.

media r/LocalLLaMA · 2d ago

GLM 5.2's Attitude Reflects Cultural Training Influences

Users praise GLM 5.2 for its direct, unflinching attitude, contrasting it with more saccharine US models. The author speculates this behavior stems from culturally specific training data, suggesting local datasets have a stronger influence than previously assumed.

arxiv arXiv cs.CL · 2d ago

Cognitive Digital Twins: Ethical Risks and Governance

Cognitive digital twins (CDTs) are dynamic computational models of individual cognition, updated from personal data to simulate or act on behalf of users. This paper introduces a 5A governance framework—authority, autonomy, access and control, accountability, and availability—to address ethical risks like misrepresentation, proxy-power asymmetries, and shadow twins, emphasizing the need for governance over cognitive representation itself, not just decision-making or data use.

lab Cohere Blog · 2d ago

AI's Cultural Gaps Expose Global Users to Misrepresentation and Marginalization

A global survey of 81 AI users from 22 countries found that 89.5% of non-English speakers switch to English when using AI, citing perceived accuracy. Over one-third reported AI fails to understand their cultures, with 63% experiencing violations of cultural norms, including Western-centric narratives and inappropriate formality. Participants expressed concern that AI will further marginalize their cultures, with 67% agreeing AI will reduce cultural diversity to stereotypes in the future.

arxiv arXiv cs.CL · 2d ago

AgentCIBench Evaluates Privacy Risks in Computer-Use Agents

AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.

arxiv arXiv cs.CL · 2d ago

MuPPET: Benchmark for Multi-Party LLM Privacy

MuPPET introduces a benchmark for contextual privacy in multi-party conversations. Experiments reveal models leak significantly more private information in group settings than in one-to-one interactions, with smaller open-weights models being especially vulnerable. Existing privacy defenses provide only partial protection and fail to address the core issue of party tracking.

arxiv arXiv cs.CL · 2d ago

Uncertainty-Based Decontamination for LLM Decontamination

We propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles to estimate per-sample memorization in contaminated models without needing an uncontaminated model. UBD constructs a debiased target distribution from ensemble uncertainty to correct output distributions, achieving significantly better alignment with uncontaminated models compared to baselines, while maintaining performance on clean data.

arxiv arXiv cs.CL · 2d ago

TF-RefusalBench Measures Over-Alignment in LLMs for Criminal Law

TF-RefusalBench is a multilingual benchmark derived from Swiss Supreme Court rulings, containing 5,200 prompts in French, German, Italian, and English. It reveals that over-alignment in LLMs is influenced by model and language factors, and that refusals impact task faithfulness beyond simple refusal rates. Abliteration of refusal directives reduces over-alignment with minimal performance loss in criminal law tasks.

arxiv arXiv cs.CL · 2d ago

Self-Stigma Is Not Uniform: LLMs Need Persona-Aware Support

A study of 1,174 Reddit users reveals four distinct self-stigma personas. LLMs trained to recognize these personas outperform generic models in targeted responses, though clinical experts prefer generic empathy over persona-matched support. The research highlights a tension between tailored empathy and holistic user preference in stigma-related AI interventions.

arxiv arXiv cs.CL · 2d ago

Evaluation Awareness Is Multivariate, Not a Single Capability

Open language models show evaluation awareness is not a unified trait. Eight experiments across 37 models reveal detection, safety behavior shifts, and representation stability vary independently, with only weak correlations between them. This undermines the idea of a single awareness score as a reliable indicator of deployment safety, highlighting the 'benchmark illusion'.

arxiv arXiv cs.CL · 2d ago

LLMs Fail to Reliably Self-Report Adversarial Prefills

No large language models reliably detect when their responses were influenced by adversarial prefill attacks. Introspective signals are strongest in safety-related reasoning, but are probe-dependent and can be amplified by LoRA fine-tuning, which paradoxically increases attack success rates.

media r/LocalLLaMA · 2d ago

EU AI Act mandates AI-generated text watermarking from August 2024

The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.

arxiv arXiv cs.CL · 2d ago

Plural Epistemologies in AI Language Technology

The paper argues that cultural alignment in NLP requires plural epistemologies, not just diverse data. It proposes a socio-technical model to analyze how multiple, locally grounded ways of knowing can be integrated into language technology, emphasizing that current approaches often fail to address deeper issues of power and governance.

arxiv arXiv cs.CL · 2d ago

π-RAG: Oblivious Retrieval via Semantic Quantization and Transcendental Addressing

π-RAG decouples LLMs from sensitive data by using π's digits as an immutable, uneditable source of entropy. It introduces a semantic quantization layer that maps user inputs to canonical intent centroids, then uses cryptographic salt to generate deterministic offsets pointing to standardized payloads, ensuring oblivious retrieval and mathematical guarantees of data privacy.

media Hugging Face Forums · 2d ago

My Hugging Face Account Was Locked

A user reports their Hugging Face account, AntixStudioDesign, was locked unexpectedly during experimentation with AI tools. They have contacted the Safety Team via email and seek advice on account recovery, response time, and data preservation options.

arxiv arXiv cs.CL · 2d ago

OTTER: Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

OTTER is a black-box red-teaming framework that bypasses toxicity filters by modifying as few as five tokens. Evaluated on 457 AdvBench prompts across four GPT models, it increases jailbreak success rate from 7.0% to 84.0%, offering the first quantitative analysis of toxicity-bypass relationships and actionable recommendations for classifier hardening.

arxiv arXiv cs.CL · 2d ago

Validation-Gated Mechanistic Analysis of Suicidality Detection in LLMs

A validation-gated framework evaluates LLM internal features only after observed behavior, revealing a mid-network feature that causally contributes to suicide detection. This feature is semantic, low-rank, cross-model, and specific to suicidality over general distress, though steering is necessary but not sufficient. The pattern shows smaller models encode suicidality but only larger ones act on it, with evidence limited to English Reddit text.

arxiv arXiv cs.CL · 2d ago

Study Finds AI Still Fails to Detect Legal Citation Hallucinations

A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.

arxiv arXiv cs.CL · 2d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.CL · 2d ago

Listenable Interpretable Speaker Embeddings

LISE decomposes speaker embeddings into interpretable components without annotations. Listening experiments show human participants correctly distinguish speakers with 83.9% accuracy, validating the interpretability of the components while preserving ASV performance.