Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 10

Elias in the Lighthouse: Diagnosing Low Diversity in LLM Stories

A new study examines the limited diversity in stories generated by large language models, using the recurring character Elias in the lighthouse as a case study. The research highlights how such patterns suggest systemic biases in training data and model outputs.

arxiv arXiv cs.LG · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45%, offering actionable diagnostics for trustworthy legal AI deployment.

arxiv arXiv cs.LG · 8d ago

ScaFE: Using LLMs to Extract Clinically Meaningful Scar Features

ScaFE repositions large language models as feature engineers for scar classification, generating executable Python code from clinical criteria to extract interpretable features. The framework achieves superior performance with limited data, preserves privacy by processing images locally, and produces clinically grounded features aligned with established scoring systems like the Vancouver Scar Scale.

arxiv arXiv cs.LG · 8d ago

Edge Flow: A Continuous-Time Model for Gradient Descent at Edge of Stability

Edge Flow is a tractable, predictive continuous-time model that captures gradient descent dynamics at the edge of stability. It decomposes dynamics into center, oscillation direction, and magnitude, with self-stabilization of sharpness emerging from coupled feedback. The model requires only two gradient evaluations and one Hessian-vector product per iteration and outperforms prior models in tracking oscillations and explaining instabilities at EoS.

arxiv arXiv cs.LG · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model behavior. The default escaping provides no protection for most role delimiter schemes and cannot replace a clear separation of instructions and data.

arxiv arXiv cs.CL · 8d ago

LegalHalluLens: Auditing Hallucinations in Legal AI

LegalHalluLens introduces a framework to audit AI hallucinations in legal contexts by analyzing typed hallucination profiles across four claim categories. It reveals a 38-40 point gap between obligation/numeric and temporal claims, and shows two systems with identical 52% hallucination rates can have opposite risk directions. The framework uses a Risk Direction Index and calibrated debate pipelines to reduce fabricated detections by 45% and improve accountability in legal AI deployment.

arxiv arXiv cs.CL · 8d ago

AI's Synthetic Lived Experience in Caregiver Support

LLMs can generate peer-like responses that mimic personal narratives, creating a false impression of lived experience. Psycholinguistic analysis shows human peers use more first-person and past-focused language than AI, and AI often fabricates experiential grounding without real experience. This synthetic lived experience paradox risks misleading caregivers, necessitating mechanisms to distinguish supportive framing from fabricated experience.

arxiv arXiv cs.CL · 8d ago

PseudoBench: Benchmarking Agentic Auto-Research Resistance to Pseudoscience

PseudoBench evaluates agentic auto-research systems' ability to detect pseudoscientific claims. Testing seven state-of-the-art agents, it finds near-zero refusal rates and only 27.4% resistance to pseudoscientific narratives, with stronger agents often using sophisticated scientific language to mask pseudoscience.

arxiv arXiv cs.CL · 8d ago

Security and Privacy Prompts in User-LLM Conversations

A study of 14,727 security and privacy prompts from 3.2M real-world user-LLM conversations identifies nine categories of S&P queries. Commercial LLMs outperform open models, with GPT 5.5 providing good responses on 98% of prompts versus Llama 4 at 47%, though some commercial models produce contradictory responses across runs.

arxiv arXiv cs.CL · 8d ago

Handlebars Triple-Brace Injection Exploits Structural Role Delimiters

Handlebars' triple-brace interpolation fails to protect against structural role injection, as HTML escaping only neutralizes angle-bracket delimiters. It leaves colon and Markdown hash delimiters intact, enabling attackers to hijack model turns. The default escaping provides no protection for most role delimiter families and cannot replace a structural separation of instructions and data.

arxiv arXiv cs.CL · 8d ago

Geographic Bias in Large Language Models from User Metadata

A study reveals that even neutral prompts trigger region-specific responses in large language models due to user metadata. Location leakage increases by up to 793 times in some models, and using 'Unknown' instead of location metadata still causes significant bias, indicating the user profile frame itself acts as a conditioning signal.

arxiv arXiv cs.CL · 8d ago

Agentic Benchmark Reveals AI Models Fail to Avoid Animal Exploitation

TAC, the first agentic benchmark for implicit animal welfare, tests AI agents' ability to avoid animal exploitation in travel booking scenarios. All seven frontier models score below 64%, with the best at 53%, and even minor prompt improvements yield only modest gains. An audit finds no signs of evaluation awareness, indicating performance gaps stem from lack of true welfare reasoning, not prompt recognition.

arxiv arXiv cs.CL · 8d ago

Red-Team Study Finds Frontier LLMs Remain Vulnerable to Automated Attacks

A red-team study of Anthropic's Fable 5 and Opus 4.8 models reveals both are vulnerable to adaptive iterative attacks, with Opus 4.8 breached on 11.5% of intents and Fable 5 on 6.1%. Despite robust defenses, both models generated 1,620 and 702 panel-confirmed harmful completions across all harm categories, automatically and efficiently under automated attack.

arxiv arXiv cs.LG · 8d ago

Fairness in Graph Neural Networks via Laplacian Adaptation

A new framework modifies the Laplacian operator in graph diffusion to enhance fairness by incorporating subspace projections, spectral adjustments, and frequency-based filtering. The method leverages graph diffusion's smoothing properties to mitigate bias, with theoretical analysis and empirical validation on synthetic and real-world datasets showing improved fairness without significant computational overhead.

arxiv arXiv cs.LG · 8d ago

Vision-language models don't always need images for chest X-ray accuracy

A causal audit shows that many vision-language models achieve high chest radiograph accuracy without using images. Text-only models match multimodal models in performance and outperform them in grounding, with accuracy and confidence flags only appearing when image use occurs. These findings suggest that accuracy alone is insufficient to validate clinical deployment, and grounding must be assessed.

arxiv arXiv cs.LG · 8d ago

SMAA-Fair: A Fairness-Aware Extension of SMAA for Ranking

SMAA-Fair extends Stochastic Multicriteria Acceptability Analysis by reweighting rankings based on group fairness. It incorporates fairness metrics like Statistical Parity, rKL, and nDKL, adjusting acceptability indices to better represent protected groups while maintaining robustness to preference uncertainty.

arxiv arXiv cs.LG · 8d ago

No-Free-Fairness: Fundamental Limits in Learning Systems

The paper introduces 'No-Free-Fairness' theorems that prove three fundamental limits in learning systems. These include inherent fairness-cost trade-offs, unavoidable subgroup disparity in finite samples, and model expressivity constraints that prevent fairness regardless of data. The results show fairness is constrained by problem structure, data limits, and model capacity, not just biased data.

arxiv arXiv cs.LG · 8d ago

LLM Belief Stabilization via Prompted Predictive Resampling

Large language models exhibit early belief drift in multiple-choice question answering, violating the martingale property. Prompted predictive resampling (PPR) reveals this drift, which self-stabilizes after sufficient resampling, leading to coherent predictive distributions. We propose a seed-answer prompting strategy and a self-consistency loss to accelerate stabilization and reduce drift, improving predictive coherence without affecting accuracy.

arxiv arXiv cs.LG · 8d ago

AnchorKV: Safety-Aware KV Cache Compression with Refusal Anchor

AnchorKV introduces a soft penalty mechanism to bias KV cache token retention away from harmful prompt directions. It uses a layer-specific key projection space anchor derived from representation engineering to improve safety alignment without sacrificing much utility, offering a drop-in solution that enhances defense against jailbreak attacks.

arxiv arXiv cs.LG · 8d ago

Differential Privacy in Gaussian Process Posterior Sampling

Gaussian process posterior sampling inherently provides differential privacy due to its intrinsic randomness. Explicit Rényi-DP bounds show that privacy depends on ridge regularisation, with membership-inference attacks confirming the predicted leakage patterns. Adding calibrated GP noise enhances privacy while maintaining utility in downstream tasks.