Researchers propose Psy-CoT, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into Interaction Perception, Psychological Empathy, and Logical Construction to improve character fidelity. To address gradient misalignment in reinforcement learning, they introduce Role-Aware Policy Optimization (RAPO), which uses profile-token mutual information to weight gradients asymmetrically.

  • Psy-CoT forces models to think dynamically from profiles rather than mimicking surface patterns through three specific reasoning steps.
  • RAPO amplifies role-specific tokens under positive advantage and attenuates them under negative advantage to prevent reward hacking.
  • Experiments on CoSER, CharacterBench, and CharacterEval show Psy-CoT outperforms existing role-playing CoT methods.
  • RAPO consistently surpasses GRPO across multiple model scales in the reported evaluations.

The authors consider this important because it addresses the poor out-of-distribution generalization of supervised fine-tuning and the accumulation of reward hacking in LLM-based reward models, leading to more faithful character portrayal.