A study investigating authority bias in language models reveals that systems systematically prioritize social cues from authority figures over factual consistency. Using a controlled medical QA setting with Llama-3.1-8B, Qwen3-8B, and Gemma-2-9B, researchers found that models respond proportionally to perceived authority.

  • Logit lens analysis and probing localize the effect to a critical late layer where correct answer representations are actively erased.
  • This erasure scales with authority level and resists mean vector intervention.
  • The phenomenon is only partially reversible through chain-of-thought reasoning.

The findings suggest that authority-induced sycophancy is not a surface-level output bias but mechanistic knowledge erasure, representing a precise overwriting of correct internal representations by high-status signals.