Misfired Alignment in LLMs: A Quantitative Study
A new study introduces VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, to quantify misfired alignment in large language models. It defines the Misfired Alignment Rate (MAR) and finds that all benchmarked LLMs exhibit MARs between 4.7% and 18.9%, while human participants achieve 0%. The research shows alignment cues can amplify these failures, with evidence suppression occurring in late layers of models and emerging after instruction training.