DIRISA 2025 Annual National Research Data Workshop

Name: DIRISA 2025 Annual National Research Data Workshop
Start: 2025-07-02T07:30:00+02:00
End: 2025-07-03T17:00:00+02:00
Location: CSIR ICC

2-3 July 2025

CSIR ICC

Africa/Johannesburg timezone

Provisional programme now available.

Contact

Developing a Word Sense Disambiguation Dataset for Sesotho sa Leboa: Manual Annotation and Authentic Text Sources"

Not scheduled

20m

ICC (CSIR ICC)

ICC

CSIR ICC

Talk

Mr Hlaudi Daniel Masethe (Tshwane University of Technology)

Word Sense Disambiguation (WSD) plays a critical role in Natural Language Processing (NLP), particularly for low-resourced languages like Sesotho sa Leboa (Northern Sotho), which lack comprehensive linguistic resources. This study presents the development of a manually annotated WSD dataset tailored for Sesotho sa Leboa, addressing the scarcity of such corpora. The dataset was constructed using text collected from a variety of authentic and standardized sources, including academic dissertations, research papers, dictionaries, and curated web pages featuring ambiguous lexical items. These sources were deliberately chosen for their formal use of language, ensuring reliability in the representation of word meanings. Data collection was conducted through web scraping, utilizing the Beautiful Soup library in Python to parse HTML documents and systematically extract relevant content while removing extraneous HTML tags. Manual annotation was then performed by native language experts to label word senses based on contextual usage. This dataset provides a foundational resource for training and evaluating WSD models in Sesotho sa Leboa, promoting the development of more accurate and context-aware NLP tools for underrepresented languages.

Ms Mosima Anna Masethe

Mr Hlaudi Daniel Masethe (Tshwane University of Technology)

There are no materials yet.

DIRISA 2025 Annual National Research Data Workshop

Contact

Developing a Word Sense Disambiguation Dataset for Sesotho sa Leboa: Manual Annotation and Authentic Text Sources"

ICC

CSIR ICC

Speaker

Description

Primary author

Co-author

Presentation Materials