This paper systematically studies the stability of prompt rankings under common variability sources like random seeds and limited evaluation subsets across three open-weight LLMs and two benchmark tasks.
- Overall rank correlations are often moderate to high, but the identity of the top-performing prompt frequently changes.
- This instability leads to unreliable selection decisions for downstream use.
- The authors propose a stability-aware selection strategy based on a lower confidence bound that accounts for both performance and variance.
- This approach improves robustness in unstable settings while remaining competitive in more stable regimes.
These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.