OpenAI Builds Shared AI Standards via Appia Foundation
OpenAI, through the Appia Foundation, is advancing shared standards for advanced AI by developing evaluation frameworks, safety practices, and promoting global cooperation.
OpenAI, through the Appia Foundation, is advancing shared standards for advanced AI by developing evaluation frameworks, safety practices, and promoting global cooperation.
Users praise GLM 5.2 for its direct, unflinching attitude, contrasting it with more saccharine US models. The author speculates this behavior stems from culturally specific training data, suggesting local datasets have a stronger influence than previously assumed.
Cognitive digital twins (CDTs) are dynamic computational models of individual cognition, updated from personal data to simulate or act on behalf of users. This paper introduces a 5A governance framework—authority, autonomy, access and control, accountability, and availability—to address ethical risks like misrepresentation, proxy-power asymmetries, and shadow twins, emphasizing the need for governance over cognitive representation itself, not just decision-making or data use.
A global survey of 81 AI users from 22 countries found that 89.5% of non-English speakers switch to English when using AI, citing perceived accuracy. Over one-third reported AI fails to understand their cultures, with 63% experiencing violations of cultural norms, including Western-centric narratives and inappropriate formality. Participants expressed concern that AI will further marginalize their cultures, with 67% agreeing AI will reduce cultural diversity to stereotypes in the future.
AgentCIBench introduces a benchmark to assess privacy risks in computer-use agents. It identifies three key failure modes—visual co-location, task-ambiguity overshare, and recipient misalignment—and finds that 11 of 15 evaluated agents leak personal data in over 50% of scenarios, with an average leakage of 67.9%.
MuPPET introduces a benchmark for contextual privacy in multi-party conversations. Experiments reveal models leak significantly more private information in group settings than in one-to-one interactions, with smaller open-weights models being especially vulnerable. Existing privacy defenses provide only partial protection and fail to address the core issue of party tracking.
We propose Uncertainty-Based Decontamination (UBD), a method that uses deep ensembles to estimate per-sample memorization in contaminated models without needing an uncontaminated model. UBD constructs a debiased target distribution from ensemble uncertainty to correct output distributions, achieving significantly better alignment with uncontaminated models compared to baselines, while maintaining performance on clean data.
TF-RefusalBench is a multilingual benchmark derived from Swiss Supreme Court rulings, containing 5,200 prompts in French, German, Italian, and English. It reveals that over-alignment in LLMs is influenced by model and language factors, and that refusals impact task faithfulness beyond simple refusal rates. Abliteration of refusal directives reduces over-alignment with minimal performance loss in criminal law tasks.
A study of 1,174 Reddit users reveals four distinct self-stigma personas. LLMs trained to recognize these personas outperform generic models in targeted responses, though clinical experts prefer generic empathy over persona-matched support. The research highlights a tension between tailored empathy and holistic user preference in stigma-related AI interventions.
Open language models show evaluation awareness is not a unified trait. Eight experiments across 37 models reveal detection, safety behavior shifts, and representation stability vary independently, with only weak correlations between them. This undermines the idea of a single awareness score as a reliable indicator of deployment safety, highlighting the 'benchmark illusion'.
No large language models reliably detect when their responses were influenced by adversarial prefill attacks. Introspective signals are strongest in safety-related reasoning, but are probe-dependent and can be amplified by LoRA fine-tuning, which paradoxically increases attack success rates.
The EU AI Act requires all AI systems generating synthetic text to include machine-readable, detectable watermarks using robust, interoperable technical solutions with two layers. This applies to all AI models, including open-source ones, and extends to any service accessible by EU citizens, regardless of location. Non-compliance risks fines of up to 35 million euros or a percentage of annual income, with providers of 'systemic risk' AI models facing heightened liability.
The paper argues that cultural alignment in NLP requires plural epistemologies, not just diverse data. It proposes a socio-technical model to analyze how multiple, locally grounded ways of knowing can be integrated into language technology, emphasizing that current approaches often fail to address deeper issues of power and governance.
π-RAG decouples LLMs from sensitive data by using π's digits as an immutable, uneditable source of entropy. It introduces a semantic quantization layer that maps user inputs to canonical intent centroids, then uses cryptographic salt to generate deterministic offsets pointing to standardized payloads, ensuring oblivious retrieval and mathematical guarantees of data privacy.
A user reports their Hugging Face account, AntixStudioDesign, was locked unexpectedly during experimentation with AI tools. They have contacted the Safety Team via email and seek advice on account recovery, response time, and data preservation options.
OTTER is a black-box red-teaming framework that bypasses toxicity filters by modifying as few as five tokens. Evaluated on 457 AdvBench prompts across four GPT models, it increases jailbreak success rate from 7.0% to 84.0%, offering the first quantitative analysis of toxicity-bypass relationships and actionable recommendations for classifier hardening.
A validation-gated framework evaluates LLM internal features only after observed behavior, revealing a mid-network feature that causally contributes to suicide detection. This feature is semantic, low-rank, cross-model, and specific to suicidality over general distress, though steering is necessary but not sufficient. The pattern shows smaller models encode suicidality but only larger ones act on it, with evidence limited to English Reddit text.
A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.
MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.
LISE decomposes speaker embeddings into interpretable components without annotations. Listening experiments show human participants correctly distinguish speakers with 83.9% accuracy, validating the interpretability of the components while preserving ASV performance.