A new pipeline enables clinicians to perform remote annotation and blinded evaluation of ultrasound AI models without local data downloads. It supports multi-rater participation, result aggregation, and automated statistical analysis, validated in a fetal ultrasound segmentation study with six raters of varying expertise. Results show moderate to strong agreement and a preference for later active learning models in blinded rankings.