arxiv arXiv cs.CL · 6d ago · research

Causal Activation Directions for Mitigating Emergent Misalignment in Language Models

from English

Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.

Importance 3/3 New harness with differentiators arXiv cs.CL Alibaba (Qwen) Meta AI Mistral AI Evaluation & benchmarks Reasoning models Safety & alignment

Benchmarks

Benchmark	Model	Score
SWE-bench Verified	Gemma-2-2B	99.6%
SWE-bench Verified	Llama-3.2-1B	99.6%
SWE-bench Verified	Minstral-3-3B	99.6%
SWE-bench Verified	Qwen2.5-1.5B	99.6%

Read original