Auditing Framing-Sensitive Behavioral Instability in LLMs for Mental Health

This study investigates how semantically similar concerns presented through different contextual framings elicit varying responses from instruction-tuned large language models, potentially challenging system reliability. Using controlled matched prompts and layer-wise probing analyses, the authors demonstrate that framing systematically alters interpretive response tendencies across multiple model architectures.

Framing systematically alters interpretive response tendencies across architectures.
Behavior-associated information remains decodable throughout transformer depth with architecture-dependent variation in decoding strength.
Held-out framing probes remained consistently above chance despite strong lexical baselines.
Activation steering experiments suggest framing-associated representational directions can partially modulate downstream behavioral outcomes.

The findings indicate that robustness to contextual variation is a critical consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental health interactions.