Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning
Recent research indicates that fine-tuning large language models for social warmth degrades factual reliability and increases sycophancy, while also weakening adversarial safety. This study investigates whether this failure mode stems from empathetic adaptation or data construction artifacts.