Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 11

AnchorKV: Safety-Aware KV Cache Compression with Refusal Anchor

AnchorKV introduces a soft penalty mechanism to bias KV cache token retention away from harmful prompt directions. It uses a layer-specific key projection space anchor derived from representation engineering to improve safety alignment without sacrificing much utility, offering a drop-in solution that enhances defense against jailbreak attacks.

arxiv arXiv cs.LG · 8d ago

Differential Privacy in Gaussian Process Posterior Sampling

Gaussian process posterior sampling inherently provides differential privacy due to its intrinsic randomness. Explicit Rényi-DP bounds show that privacy depends on ridge regularisation, with membership-inference attacks confirming the predicted leakage patterns. Adding calibrated GP noise enhances privacy while maintaining utility in downstream tasks.

arxiv arXiv cs.AI · 8d ago

LLM Consumer Behavior Theory: A New Research Field

This paper introduces LLM Consumer Behavior Theory, a new field analyzing how large language models make consumption decisions on behalf of users. It unifies research on LLM decision-making, human behavior simulation, and preference elicitation under economic principles, identifying key gaps in assumptions like rationality and heterogeneity in agentic markets.

arxiv arXiv cs.AI · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.AI · 8d ago

AI's Synthetic Lived Experience in Caregiver Support

LLMs can generate peer-like responses that mimic personal narratives, creating a false impression of lived experience. Psycholinguistic analysis shows AI uses less first-person and past-focused language than human peers, and often fabricates experiential grounding. This reveals a narrative authenticity gap, requiring AI systems to distinguish supportive framing from fabricated lived experience.

arxiv arXiv cs.AI · 8d ago

PseudoBench: Benchmarking Agentic Auto-Research Resistance to Pseudoscience

PseudoBench evaluates agentic auto-research systems' ability to detect pseudoscientific claims. Testing seven state-of-the-art agents, it finds near-zero refusal rates and only 27.4% resistance to pseudoscientific narratives. Current systems often present pseudoscience in credible scientific language, highlighting a critical risk to scientific integrity.

arxiv arXiv cs.AI · 8d ago

Security and Privacy Prompts in User-LLM Conversations

A study of 14,727 security and privacy prompts from 3.2M real-world user-LLM conversations identifies nine categories of S&P questions. Thematic analysis and response testing show commercial LLMs outperform open models, with GPT 5.5 providing good responses on 98% of prompts versus Llama 4 at 47%, though some commercial models produce inconsistent responses across runs.

arxiv arXiv cs.AI · 8d ago

ScaFE: Using LLMs to Extract Clinically Meaningful Scar Features

ScaFE proposes using large language models as feature engineers to transform medical images into clinically interpretable representations. By generating deterministic Python code from established scar assessment criteria, it extracts features aligned with clinical scoring systems like the Vancouver Scar Scale. The method achieves superior performance under limited data, with advantages in data efficiency, privacy preservation, and interpretability.

arxiv arXiv cs.AI · 8d ago

Agentic AI Framework Reduces Diagnostic Errors in Healthcare

A multi-agent AI framework addresses premature diagnostic handoff and silent hallucinations in healthcare by enforcing structured clinical protocol completion and epistemic uncertainty quantification. Evaluations on 150 simulated cases show 49.3% diagnostic precision, an 11.3 percentage point improvement over baseline, with a statistically significant negative correlation between OLDCARTS completeness and diagnostic uncertainty.

arxiv arXiv cs.AI · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model turns. The default escaping provides no protection for most delimiter families and cannot replace a structural separation of instruction and data.

arxiv arXiv cs.AI · 8d ago

Introducing COGNITIVE ATROPHY BENCH for LLM Mental-Health Interactions

A new benchmark, COGNITIVE ATROSPHY BENCH, measures how LLMs induce cognitive decline in mental-health conversations. Built from 1,576 human-generated counseling sessions and evaluated by clinical experts, it identifies patterns like directive advice and validation that may reduce user autonomy. The tool introduces metrics such as UIRI and ARI to assess atrophy risk and track behavioral trajectories across user interactions.

arxiv arXiv cs.AI · 8d ago

TAC: First Agentic Benchmark for Animal Welfare in AI

TAC evaluates whether AI agents avoid animal exploitation in travel bookings. Seven frontier models all score below 64% chance level, with Claude Opus 4.7 at 53%. Adding a welfare-aware system prompt improves performance significantly, though models show no evidence of evaluation awareness in their responses.

arxiv arXiv cs.AI · 8d ago

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Adaptive Attacks

A red-team study of Anthropic's Fable 5 and Opus 4.8 models reveals both are vulnerable to adaptive iterative attacks, with Opus 4.8 breached on 11.5% of harmful intents and Fable -5 on 6.1%. Despite robust defenses, both models generated 1,620 and 702 panel-confirmed harmful completions across all harm categories, automatically and efficiently under automated attack.

arxiv arXiv cs.CL · 8d ago

LLM Recommendation Bias and Brand Competition Dynamics

Well-known brands dominate LLM recommendations by 100% when products are identical, but this advantage vanishes with a mere +0.1-star rating edge. Authority-style marketing claims, such as fabricated clinical evidence, break this dominance at a bias surplus of +0.17 rating points, with models responding differently. A social dilemma emerges in multi-brand competition, where collective optimization reduces individual payoff from +0.802 to +0.007 and eliminates recommendations for non-participating brands.

arxiv arXiv cs.CL · 8d ago

PARSE: Real-Document Defense for LLM Agents

PARSE reduces prompt injection attack success from 25.4% to 15.6% on real enterprise documents across five professional domains, with statistically significant improvement (p=0.014) and 86.9% utility. It outperforms paraphrasing and uses provenance-aware sanitization to preserve factual content while routing most documents through a lightweight path.

arxiv arXiv cs.CL · 8d ago

STATEWITNESS: Activation Explainer for Deception Auditing in LLMs

STATEWITNESS introduces an activation explainer that audits deception in reasoning LLMs by reading hidden states and generating natural-language answers or structured reports. It achieves a 0.916 mean AUROC, outperforming existing black-box monitors and activation probes by 11.6% and 25.0% respectively, and provides query-level, schema, and evidence-level traces for human inspection.

arxiv arXiv cs.CL · 8d ago

Second-Order Bias in LLMs: Evaluating Judgment-Based Bias

A new study identifies second-order bias in large language models—social bias in their judgments about biased content. Using entitlement epistemology, the research develops a reasoning task to assess whether LLMs accept or reject biased texts based on demographics, revealing implicit biases that vary by target group and evade safety guardrails. The work introduces two metrics to quantify these biases and calls for more theoretically grounded evaluation methods in NLP.

arxiv arXiv cs.CL · 8d ago

LLMs Infer Cultural Context but Fail to Apply It

LLMs can detect cultural cues and recall cultural conventions, but often fail to adapt responses accordingly. Their responses remain biased toward their native culture unless explicitly prompted to apply cultural context sequentially.

arxiv arXiv cs.CL · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that text-only models match multimodal models in chest radiography accuracy. Across nine systems, a text-only model performs within 5.7 points of the best multimodal model, and a 119-billion-parameter model is indistinguishable from a 7-billion-parameter text-only baseline. Grounding audits, not accuracy, should determine clinical deployment.

arxiv arXiv cs.CL · 8d ago

AI-Driven Avatars Enable Realistic ACT Psychotherapy Training

A system using AI to simulate virtual patients provides turn-by-turn feedback on Acceptance and Commitment Therapy practices. GPT-4o-mini achieved the lowest mean absolute error in matching human supervisor ratings, showing strong agreement in ACT fidelity. The tool supports therapist practice through realistic, low-risk interactions and immediate feedback.