On the Stability of Prompt Ranking in Large Language Model Evaluation
This paper systematically studies the stability of prompt rankings under common variability sources like random seeds and limited evaluation subsets across three open-weight LLMs and two benchmark tasks.