Research Associate, University of Pretoria | Co-Founder, HausaNLP & ArewaDS | Member, MasaKhaneNLP
Our 2026 publications so far and where African NLP is heading
Publicly available work I’ve been part of this year in translation, sentiment, speech, language ID, and benchmarking for African languages:
AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation Idris Abdulmumin et al. (2026) — Parallel corpus across six African languages, confronting the terminology gap that blocks native-language access to science. https://lnkd.in/dUPPbHck
Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora Idris Abdulmumin et al. (2026) — Setswana sentiment dataset analyzing how inter-annotator agreement decays over long annotation campaigns. https://lnkd.in/dnADuScY
Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks Tadesse Destaw Belay et al. (2026) — Scalable alternative to majority voting: cluster annotators by agreement to preserve diverse perspectives. https://lnkd.in/dFPQQnaR
NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages Marie Maltais et al. (2026) — Parallel speech translation dataset for Igbo, Hausa, Yoruba, and Nigerian Pidgin paired with English. https://lnkd.in/dYeWemMT
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA) Liang-Chih Yu et al. (2026) — Shared task moving ABSA from categorical polarity to valence-arousal modeling, extended to public-issue discourse. https://lnkd.in/dk7KwCAq
SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization Usman Naseem et al. (2026) — 22 languages and 110K instances for detecting online polarization, its type, and its manifestation. https://lnkd.in/dzM7KvSM
DimStance: Multilingual Datasets for Dimensional Stance Analysis Jonas Becker et al. (2026) — Stance datasets modeling attitudes along continuous valence-arousal dimensions instead of Favor/Neutral/Against bins. https://lnkd.in/d8MVBBm2
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Pedro Ortiz Suarez et al. (2026) — Community-built LID benchmark across 109 languages for the noisy web domain, where current LID still fails. https://lnkd.in/dBgW6b6P
Afri-MCQA: Multimodal Cultural Question Answering for African Languages Atnafu Lambebo Tonja et al. (2026) — First multilingual cultural QA benchmark for 15 African languages: 7.5K parallel pairs across text and speech, built by native speakers. https://lnkd.in/dd-SsvKn
Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivate et al. (2026) — 3,000-hour ASR dataset covering seven South African languages across agriculture, healthcare, and general domains. https://lnkd.in/dMcxXDXR
Gratitude to every co-author, annotator, and native-speaker reviewer who makes this work possible. More in the pipeline.
#AfricanNLP #LowResourceNLP #ResearchHighlights