Prompt rankings in large language model evaluation are often unstable under minor variations like random seeds and limited subsets. A stability-aware selection strategy using lower confidence bounds improves robustness by accounting for both performance and variance, while maintaining competitiveness in stable settings.
Stability of Prompt Ranking in LLM Evaluation
from English