Introducing corpora Hlava Cor and Hlava AD: Human Label Variation in Coreference and Discourse Relations
Researchers have created two new corpora, Hlava Cor and Hlava AD, to explore human variation in understanding text coherence. These resources contain multiple annotations of Czech texts along with annotators' explanations for their choices. The first corpus, Hlava Cor, consists of 1,024 contexts annotated by three individuals to capture coreference identification differences. It covers pronouns, full noun phrases, and anaphoric adverbials across various text types and grammatical-semantic categories. The second corpus, Hlava AD, comprises 512 contexts annotated by five annotators focusing on discourse relations in attributive and non-attributive constructions. Both corpora achieve an inter-annotator agreement of approximately 60-65 percent. Analysis reveals that lower coreference agreement correlates with automatic model disagreement, indicating higher ambiguity. Annotator comments further highlight varying confidence levels and individual reading strategies.