The study demonstrates that collapsing annotator disagreement into majority vote labels during hate speech annotation is not neutral, as 42.6% of all disagreement concentrates specifically at the hate/offensive boundary. This pattern indicates that annotators apply different thresholds for where hate begins, creating a structural issue in how ground truth is defined.

  • 42.6% of annotator disagreement in HateXplain occurs at the hate/offensive boundary (chi-squared = 135.199, df = 2, p < 0.0001).
  • Both hard-label BERT (Model A) and soft-label models drop 22 percentage points in accuracy from agreed posts (~80%) to disagreement posts (~58%).
  • A per-annotator multi-head model (Model C) widens the accuracy gap to 28 points, collapsing offensive disagreement accuracy to 0.245.
  • Model A expresses significantly higher confidence on boundary case errors than Model C (0.710 vs. 0.495), meaning standard evaluation metrics fail to detect this failure.
  • Three downstream interventions of increasing sophistication all fail to recover boundary accuracy.

The authors argue that majority vote presents a contested judgment as ground truth, causing models to inherit false certainty. They conclude that the necessary intervention must be upstream in annotation design rather than applied after label aggregation.