On the Stability of Prompt Ranking in Large Language Model Evaluation

This paper systematically studies the stability of prompt rankings under common variability sources like random seeds and limited evaluation subsets across three open-weight LLMs and two benchmark tasks.

Overall rank correlations are often moderate to high, but the identity of the top-performing prompt frequently changes.
This instability leads to unreliable selection decisions for downstream use.
The authors propose a stability-aware selection strategy based on a lower confidence bound that accounts for both performance and variance.
This approach improves robustness in unstable settings while remaining competitive in more stable regimes.

These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.