This study investigates how semantically similar concerns presented through different contextual framings elicit varying responses from instruction-tuned large language models, potentially challenging system reliability. Using controlled matched prompts and layer-wise probing analyses, the authors demonstrate that framing systematically alters interpretive response tendencies across multiple model architectures.
- Framing systematically alters interpretive response tendencies across architectures.
- Behavior-associated information remains decodable throughout transformer depth with architecture-dependent variation in decoding strength.
- Held-out framing probes remained consistently above chance despite strong lexical baselines.
- Activation steering experiments suggest framing-associated representational directions can partially modulate downstream behavioral outcomes.
The findings indicate that robustness to contextual variation is a critical consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental health interactions.