This study investigates how semantically similar concerns presented through different contextual framings elicit varying responses from instruction-tuned large language models, potentially challenging system reliability. Using controlled matched prompts and layer-wise probing analyses, the authors demonstrate that framing systematically alters interpretive response tendencies across multiple model architectures.

  • Framing systematically alters interpretive response tendencies across architectures.
  • Behavior-associated information remains decodable throughout transformer depth with architecture-dependent variation in decoding strength.
  • Held-out framing probes remained consistently above chance despite strong lexical baselines.
  • Activation steering experiments suggest framing-associated representational directions can partially modulate downstream behavioral outcomes.

The findings indicate that robustness to contextual variation is a critical consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental health interactions.