Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

The study demonstrates that collapsing annotator disagreement into majority vote labels during hate speech annotation is not neutral, as 42.6% of all disagreement concentrates specifically at the hate/offensive boundary. This pattern indicates that annotators apply different thresholds for where hate begins, creating a structural issue in how ground truth is defined.

42.6% of annotator disagreement in HateXplain occurs at the hate/offensive boundary (chi-squared = 135.199, df = 2, p < 0.0001).
Both hard-label BERT (Model A) and soft-label models drop 22 percentage points in accuracy from agreed posts (~80%) to disagreement posts (~58%).
A per-annotator multi-head model (Model C) widens the accuracy gap to 28 points, collapsing offensive disagreement accuracy to 0.245.
Model A expresses significantly higher confidence on boundary case errors than Model C (0.710 vs. 0.495), meaning standard evaluation metrics fail to detect this failure.
Three downstream interventions of increasing sophistication all fail to recover boundary accuracy.

The authors argue that majority vote presents a contested judgment as ground truth, causing models to inherit false certainty. They conclude that the necessary intervention must be upstream in annotation design rather than applied after label aggregation.