This study investigates the instability of persona-driven generations (PDGs) in large language models when applied to multiple-choice question answering (MCQA) tasks, a domain often overlooked compared to free-form text interactions. The authors developed three metrics to evaluate performance, outcome, and question correctness stability across distinct dimensions.

  • Instability varies consistently between model families, model size, and question domains, with math and commonsense questions exhibiting greater instability.
  • Task prompt format introduces more prediction instability than other hyperparameters such as temperature.
  • Instability is related to task accuracy, revealing that different experimental settings can result in different best and worst personas despite their similarity.

The findings highlight the importance of checking hyperparameter instability in persona-driven generations to ensure reliable performance.