Research Associate, University of Pretoria | Co-Founder, HausaNLP & ArewaDS | Member, MasaKhaneNLP
        Our 2026 publications so far and where African NLP is heading

Publicly available work I’ve been part of this year in translation, sentiment, speech, language ID, and benchmarking for African languages:

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation Idris Abdulmumin et al. (2026) — Parallel corpus across six African languages, confronting the terminology gap that blocks native-language access to science. https://lnkd.in/dUPPbHck

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora Idris Abdulmumin et al. (2026) — Setswana sentiment dataset analyzing how inter-annotator agreement decays over long annotation campaigns. https://lnkd.in/dnADuScY

Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks Tadesse Destaw Belay et al. (2026) — Scalable alternative to majority voting: cluster annotators by agreement to preserve diverse perspectives. https://lnkd.in/dFPQQnaR

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages Marie Maltais et al. (2026) — Parallel speech translation dataset for Igbo, Hausa, Yoruba, and Nigerian Pidgin paired with English. https://lnkd.in/dYeWemMT

SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA) Liang-Chih Yu et al. (2026) — Shared task moving ABSA from categorical polarity to valence-arousal modeling, extended to public-issue discourse. https://lnkd.in/dk7KwCAq

SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization Usman Naseem et al. (2026) — 22 languages and 110K instances for detecting online polarization, its type, and its manifestation. https://lnkd.in/dzM7KvSM

DimStance: Multilingual Datasets for Dimensional Stance Analysis Jonas Becker et al. (2026) — Stance datasets modeling attitudes along continuous valence-arousal dimensions instead of Favor/Neutral/Against bins. https://lnkd.in/d8MVBBm2

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Pedro Ortiz Suarez et al. (2026) — Community-built LID benchmark across 109 languages for the noisy web domain, where current LID still fails. https://lnkd.in/dBgW6b6P

Afri-MCQA: Multimodal Cultural Question Answering for African Languages Atnafu Lambebo Tonja et al. (2026) — First multilingual cultural QA benchmark for 15 African languages: 7.5K parallel pairs across text and speech, built by native speakers. https://lnkd.in/dd-SsvKn

Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivate et al. (2026) — 3,000-hour ASR dataset covering seven South African languages across agriculture, healthcare, and general domains. https://lnkd.in/dMcxXDXR

Gratitude to every co-author, annotator, and native-speaker reviewer who makes this work possible. More in the pipeline.