RECOM evaluates 15,000 r/AskReddit questions with authentic community replies posted after model training. It shows no automatic metric simultaneously achieves strong validity and discriminative power, with BERTScore ranking models weakly even when length is controlled. The tradeoff arises from representation design, not model differences, and requires reporting both validity and discrimination with random-baseline floors.
RECOM: Validity-Discrimination Tradeoff in Reddit QA Metrics
from English