License Compatibility Analysis of Corpora for Low-Resource African Languages

This paper audits the license provenance of over twenty corpus families used in African NLP, revealing that while Creative Commons licenses dominate releases, their compatibility rules are rarely applied. The authors construct a six-tier compatibility matrix and apply it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore.

CC-BY-SA and CC-BY-NC cannot be combined in a single published dataset, and NoDerivs clauses silently prohibit tokenisation and annotation.
Four failure modes are documented with primary-source evidence, including outright prohibition (JW300) and composite license misrepresentation (WAXAL).
A NoDerivs clause is hidden behind a CC-BY label in the Tanzil corpus, while data persistence failure affects the Congolese Radio Corpus.

The study provides a pre-annotation due diligence checklist and surveys legally clean enrichment opportunities to address these legal and technical challenges.