Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 11

OTTER: Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization

OTTER is a black-box red-teaming framework that bypasses toxicity filters by modifying as few as five tokens. Evaluated on 457 AdvBench prompts across four GPT models, it increases jailbreak success rate from 7.0% to 84.0%, offering the first quantitative analysis of toxicity-bypass relationships and actionable recommendations for classifier hardening.

arxiv arXiv cs.CL · 2d ago

Validation-Gated Mechanistic Analysis of Suicidality Detection in LLMs

A validation-gated framework evaluates LLM internal features only after observed behavior, revealing a mid-network feature that causally contributes to suicide detection. This feature is semantic, low-rank, cross-model, and specific to suicidality over general distress, though steering is necessary but not sufficient. The pattern shows smaller models encode suicidality but only larger ones act on it, with evidence limited to English Reddit text.

arxiv arXiv cs.CL · 2d ago

Study Finds AI Still Fails to Detect Legal Citation Hallucinations

A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.

arxiv arXiv cs.CL · 2d ago

MedLayXPlain: Benchmarking Expert-Lay Gap in Medical Vision-Language Models

MedLayXPlain introduces the first large-scale benchmark for medical lay language generation, featuring 122,789 region-grounded samples across eight imaging modalities. It evaluates medical vision-language models on expert-lay alignment using a hierarchical ontology system and a lightweight evaluator, revealing a systematic gap: expert-level performance in captioning coexists with significant degradation in lay language, while general-purpose models lack clinical precision.

arxiv arXiv cs.CL · 2d ago

Listenable Interpretable Speaker Embeddings

LISE decomposes speaker embeddings into interpretable components without annotations. Listening experiments show human participants correctly distinguish speakers with 83.9% accuracy, validating the interpretability of the components while preserving ASV performance.

arxiv arXiv cs.CL · 2d ago

Sexualised AI Voices Amplify Gender Power Asymmetries

A study finds that sexualised AI voices on commercial platforms reinforce binary gender norms. Female-coded voices are more often described with submissive, sexualised terms, while male-coded voices are linked to dominance and positive traits, reflecting entrenched gendered power asymmetries.

blog Simon Willison · 2d ago

Prompt Injection as Role Confusion

Researchers identify 'role confusion' as a key vulnerability in LLMs, where models misinterpret user input due to stylistic similarities with internal role tags. Destyling user prompts reduces attack success from 61% to 10%, showing that subtle text style changes can dramatically alter model behavior, even when the content appears identical to humans.

media Latent Space · 3d ago

AI Red Teaming and Prompt Injection Risks Explained

Zico Kolter and Matt Fredrikson, co-authors of the definitive paper on indirect prompt injections and authorities on the Mythos model, discuss the growing risks of AI security. They highlight that AI systems require a distinct security mindset, with agents introducing new vulnerabilities, and that specialized red-teaming AI can outperform humans in breaking models, making AI prompt injection breaches increasingly likely.

lab OpenAI News · 3d ago

OpenAI Launches Daybreak Security Tools

OpenAI has introduced Codex Security and GPT-5.5-Cyber as part of its Daybreak suite. These tools aim to help organizations identify, validate, and patch vulnerabilities at scale.

lab NVIDIA Technical Blog · 3d ago

NVIDIA Launches Halos for Robotics: Full-Stack Functional Safety System

NVIDIA has introduced Halos for Robotics, a full-stack functional safety system designed for physical AI. It enables AI-driven safety in unstructured environments where robots operate autonomously alongside humans in factories, warehouses, hospitals, and homes.

media Hugging Face Forums · 3d ago

LLMs as Epistemic Accelerators: The Risk Is Not Only Hallucination

LLMs do not merely hallucinate; they amplify human epistemic overconfidence by turning weak hypotheses into coherent, polished claims before evidence is verified. This creates a risk of premature certainty in research, policy, and other domains, not because models lie, but because they accelerate human tendencies to favor elegant explanations over uncertainty.

media r/LocalLLaMA · 4d ago

Qwen 3.6 27B Apostate Released with Safety Removed

The Qwen 3.6 27B model has been modified using Apostate to remove safety alignment, reducing its refusal rate from 92% to 7.6%. This change results in minimal impact on the model's capabilities, with a KL divergence of 0.120.

lab Google DeepMind Blog · 4d ago

AI Control Roadmap for Internal System Security

An AI Control Roadmap has been introduced to secure internal systems by integrating traditional safeguards with real-time monitoring capabilities.

media AI News (smol.ai) · 4d ago

GLM-5.2 Emerges as Leading Open-Weight Coding Model

GLM-5.2 is widely regarded as the first open-weight coding model that rivals frontier models like Opus 4.8 and GPT-5.5 in capability. Practitioners highlight its strong tool use, long-horizon planning, and autonomous subagent behavior, with consensus that it now credibly operates in the frontier SWE range. The model's emergence underscores growing value of open weights for provider competition, on-prem deployment, and reduced vendor lock-in.

media r/LocalLLaMA · 6d ago

Benchmarking or benchmarketing?

LLM benchmarking is increasingly seen as marketing rather than objective measurement. Users question which benchmarks are genuinely meaningful for local models, rather than superficial score-based claims.

media r/LocalLLaMA · 6d ago

Local LLM Censorship Reported on Reddit

Users report that local language models are refusing to answer questions without guardrails, raising concerns about censorship in decentralized AI setups. The issue was shared on Reddit's LocalLLaMA community, where users describe instances of models blocking responses to legitimate queries.

arxiv arXiv cs.AI · 6d ago

NRT-Bench: Multi-turn Red-teaming of LLM Agents in Safety-Critical Systems

NRT-Bench introduces a benchmark for multi-turn red-teaming of LLM agents operating in a simulated nuclear power plant. Across four frontier operator models, 8.7% to 12.1% of attack sessions result in loss of a critical safety function, with vulnerabilities largely disjoint across models. The effectiveness of defences varies significantly by model, showing strong model dependence.

arxiv arXiv cs.AI · 6d ago

Defensive Misdirection Against Automated Attacks on Agentic AI

Agentic AI systems face growing threats from model-guided automated attacks. A new defense strategy, Contextual Misdirection via Progressive Engagement (CMPE), reduces attacker success rates by up to two orders of magnitude and nearly eliminates verified attack success in benchmark tests.

arxiv arXiv cs.AI · 6d ago

Evaluator Bias Propagation in Multi-Agent LLM Systems

Contagion Networks introduces a framework to measure how evaluator biases spread among LLM agents. In a 3-agent experiment, biases propagated consistently with contagion coefficients between 0.157 and 0.352, and homogeneous-model agents showed significantly weaker contagion than cross-model setups. Increasing evaluator committee size from k=1 to k=3 reduced effective contagion by 72.4%.

arxiv arXiv cs.AI · 6d ago

Calibration Without Comprehension in LLM Vulnerability Detection

CWE-Trace evaluates eight vanilla and 15 LoRA-fine-tuned LLMs on Linux kernel vulnerability detection. Results show data contamination offers no advantage, and fine-tuning only shifts output thresholds without altering decision policies. Despite improved detection scores, LLMs lack reliable security reasoning, with top-1 CWE accuracy below 1.3% and binary detection performance at 52.1%.