Stability of Prompt Ranking in LLM Evaluation

Prompt rankings in large language model evaluation are often unstable under minor variations like random seeds and limited subsets. A stability-aware selection strategy using lower confidence bounds improves robustness by accounting for both performance and variance, while maintaining competitiveness in stable settings.