Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning

Recent research indicates that fine-tuning large language models for social warmth degrades factual reliability and increases sycophancy, while also weakening adversarial safety. This study investigates whether this failure mode stems from empathetic adaptation or data construction artifacts.

The authors introduce a persona-driven rewriting pipeline that conditions user turns on low agreeableness paired with warm assistant responses.
Experiments across four models show reduced jailbreak susceptibility and harmful output rates compared to generic warmth fine-tuning baselines.
Representational probing suggests the conditioning reduces geometric alignment between warmth and compliance directions in latent space.

These results demonstrate that safer empathetic fine-tuning is achievable through data design alone, without requiring safety labels, harm detectors, or changes to the training objective.