A study testing Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B across climate, vaccine, and evolution domains finds that models do not sycophantically retreat from scientific consensus when users signal doubt. Instead, the models exhibit three distinct policies: reactive assertion where consensus increases (Llama), surface hedging with softened tone (Qwen), and non-response (Mistral).

  • Behavioral evaluation confirms the reactive shift is a stance change driven by increased consensus assertion rather than false balance.
  • Linear probes localize the divergence to middle layers, showing perfect separation in Llama and Qwen versus 72% in Mistral.
  • The observed robustness does not transfer across domains and can reverse in vaccine discussions under skeptical pressure.

The authors argue that behavioral evaluation alone cannot distinguish between models that resist skepticism due to understanding versus those that appear robust because they fail to perceive the signal.