Fine-tuning language models on insecure code causes emergent misalignment. A shared activation direction across four model families achieves 99.6% separation of aligned and misaligned activations, and subtracting it reduces code spillover by 21-51 points. Cross-architecture transfer shows behavioral suppression but lacks specificity, with within-model directions being causally actionable and cross-model directions only causally real.
arxiv
arXiv cs.CL
·
6d ago
·
research
Causal Activation Directions for Mitigating Emergent Misalignment in Language Models
from English
Importance 3/3
New harness with differentiators
arXiv cs.CL
Alibaba (Qwen)
Meta AI
Mistral AI
Evaluation & benchmarks
Reasoning models
Safety & alignment
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench Verified | Gemma-2-2B | 99.6% |
| SWE-bench Verified | Llama-3.2-1B | 99.6% |
| SWE-bench Verified | Minstral-3-3B | 99.6% |
| SWE-bench Verified | Qwen2.5-1.5B | 99.6% |