This paper systematically studies the stability of prompt rankings under common variability sources like random seeds and limited evaluation subsets across three open-weight LLMs and two benchmark tasks.

  • Overall rank correlations are often moderate to high, but the identity of the top-performing prompt frequently changes.
  • This instability leads to unreliable selection decisions for downstream use.
  • The authors propose a stability-aware selection strategy based on a lower confidence bound that accounts for both performance and variance.
  • This approach improves robustness in unstable settings while remaining competitive in more stable regimes.

These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.