ParaPairAudioBench: Benchmark for Paralinguistic Speech Evaluation
ParaPairAudioBench introduces a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions. It reveals that current LALM judges lag human judgments by 32% on average and fail to calibrate, especially in tie cases where abstention is correct.