A study compares seven confidence score methods across 25 model-dataset pairs, finding that single-shot verbalized confidence ranks cases well but offers only a few distinct values, limiting operator thresholds. Multi-query aggregation widens the score granularity gap, improving weak models but degrading strong ones, with trade-offs that inform practical deployment.