Safety & alignment — korshunov.ai

Safety & alignment Page 1 / 11

CEAP Reduces Variance in LLM Circuit Discovery

CEAP, a new circuit discovery method, substantially reduces resampling variance compared to EAP-IG. The paper shows that rephrasing variance arises from prompt templates activating different circuits, suggesting LLMs are inherently hard to steer across diverse inputs. Sample-wise variance is largely benign, as poor unfaithfulness scores result from selective contribution scaling, not circuit defects.

arxiv arXiv cs.LG · 9d ago

Causal Framework for Auditing Synthetic Data Disclosures

A model-agnostic auditing framework detects and distinguishes true and phantom disclosures in synthetic data. It uses only synthetic outputs and a held-out control set to perform statistical testing, offering tighter privacy leakage bounds than prior methods without requiring model access or additional training.

arxiv arXiv cs.LG · 9d ago

Neural EXposure Interaction Search for Interpretable HTE

NEXIS identifies causal heterogeneous treatment effects by discovering Markov-blankets in pre-treatment data. It leverages multi-modal, multi-view measurements and scalable representations with minimal human input, enabling interpretable and actionable policy insights from controlled experiments.

arxiv arXiv cs.LG · 9d ago

DP-FL Backdoor Attacks: RING Exploits Privacy for Malicious Signals

A new attack, RING, exploits differential privacy in federated learning to conceal backdoor signals while maximizing impact. It achieves 90.3% attack success against state-of-the-art defenses, up to 26.08x over baseline methods, and reveals a critical security gap in DP-FL due to inherent masking of malicious updates.

media r/LocalLLaMA · 10d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

blog Simon Willison · 10d ago

Cloudflare CAPTCHA triggered only for searches with ampersand

Simon Willison configured Cloudflare's CAPTCHA to activate only for search queries containing at least one ampersand. The rule uses a custom filter: (http.request.uri.path wildcard r"/search/*" and http.request.uri.query contains "&"). This allows simple searches like /search/?q=lemur to pass without CAPTCHA.