This paper introduces SCPO, a novel reward model training algorithm that balances diverse cultural preferences across subcommunities. SCPO improves minority reward model performance by up to 7 points on two datasets and seven countries, while being up to 280% more training data-efficient than full-data fine-tuning. Analysis shows reduced bias through targeted subcommunity preference evaluation.
Steerable Cultural Preference Optimization of Reward Models
from English