DIRISA 2024 Annual National Research Data Workshop

Name: DIRISA 2024 Annual National Research Data Workshop
Start: 2024-07-02T07:00:00+02:00
End: 2024-07-03T17:00:00+02:00
Location: CSIR ICC

2 Jul 2024, 07:00 → 3 Jul 2024, 17:00 Africa/Johannesburg

ICC-G-Ruby - Ruby Auditorium (CSIR ICC)

ICC-G-Ruby - Ruby Auditorium

CSIR ICC

136

Description

The 6th DIRISA Annual National Research Data Workshop

We are delighted to invite you to the 7th DIRISA Annual National Research Data Workshop, an esteemed gathering that convenes experts and enthusiasts from various domains pertinent to research data. The workshop aims to propel the progress of data-intensive research and foster an enhanced comprehension of research data management, with a keen focus on augmenting the quality of life in South Africa and the broader African context.

In an era characterized by digital ubiquity and interconnectivity, the significance of this workshop cannot be overstated. It serves as a cornerstone for nurturing a data-literate society, facilitating invaluable networking opportunities, and fostering collaborative connections among like-minded individuals. By fostering the exchange of innovative ideas and best practices, participants are empowered to extrapolate insights applicable across diverse domains.

Theme: "Research data repositories and services for the future"

Topics: Data Management
Big data and machine learning
Open data and security

Event Details:

Venue: CSIR International Convention Centre
Format: In-person event with a structured program at the ICC

Programme:

PDF of programme and schedule.

Registration:

Online registration will remain open until midnight 3 July 2024 but only card payments are now available via PayFast.

Subsequently, onsite registration will be available at the venue, subject to full fees.

Fees:

Early Bird Registration: R200.00
Late Registration: R500.00

We eagerly anticipate your participation in this enriching event as we collectively strive towards harnessing the transformative potential of research data for the betterment of society. Should you have any inquiries or require further information, please do not hesitate to reach out.

Contact

XNkosi@csir.co.za

helpdesk@chpc.ac.za

Tuesday, 2 July
- Tue, 2 Jul
- Wed, 3 Jul
- 07:00 → 08:00
  
  Arrival, Registration and Breakfast 1h
- 08:00 → 10:10
  Session: Opening
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 08:00
    
    Welcome 5m
    
    Welcome
    by
    Programme Director,
    Mrs Ina Smith
    
    Speaker: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 08:05
    
    Opening 15m
    
    Opening
    by
    NGEI Executive Manager
    Dr Lulama Wakaba
    
    Speaker: Dr Lulama Wakaba (NGEI Executive Manager)
  - 08:20
    
    Update on South African Cyberinfrastructure Initiatives 15m
    
    Update on South African Cyberinfrastructure
    Initiatives
    by
    Dr Happy Sithole
    NICIS Centre Manager
  - 08:35
    
    How do we measure the impact of big data in society? 15m
    
    How do we measure the impact of big data in
    society?
    by
    DSI Deputy Director, Cyber Infrastructure
    Mr Daniel Mokhohlane
    
    Speaker: Mr Daniel Manama Mokhohlane (DSI)
  - 08:50
    
    DOCiD (Digital Object Container iD) by the Africa PID Alliance 30m
    
    The Africa PID Alliance’s overall objective is to produce DOIs in Africa through its open infrastructure and the integration of different identifiers to disseminate indigenous knowledge , cultural heritage and patent data.
    
    The DOI being produced by the Africa PID Alliance is called the Digital Object Container Identifier (DOCiD TM) which is multilinear in nature and can accommodate different other identifier types and connect to the object being assigned.
    
    To achieve this goal The Africa PID Alliance intends to be a global open infrastructure provider, where partnership conversations have began with the DOI Foundation membership.
    
    In this participation to the honourable DIRISA 2024 Annual National Research Data workshop, we intend to present the latest updates about our DOCiD and seek further collaboration and partnerships to build up on what we achieved so far.
    
    Speaker: Mr Nabil Ksibi (Africa PID Alliance)
  - 09:20
    
    The prediction of the South African elections’ outcome 50m
    
    Abstract
    The CSIR used its highly acclaimed elections prediction model to predict the outcome of the 2024 elections held on the 29 May 2024. This model was initially developed for South African elections and was first introduced during the 1999 general elections. Since its inception, the model has successfully forecasted the outcomes for local government elections with just 5% to 10% of VDs declared. It has garnered international recognition for its accuracy across various electoral systems. The tool also plays an integral role in authenticating the final results of the elections.
    The model relies on two main principles of understanding voting behaviour: past voting patterns and influences from political, socioeconomic and demographic factors. Using this data, CSIR statisticians and data scientists group voters into clusters, anticipating that changes in voting behaviour will be similar within each group. When the early results arrive, the model uses this data to estimate new voting behaviour for similar groups of districts. These estimates are then extended to the remaining districts yet to be counted. By combining known results with predicted ones, the model then generates a final prediction. The model correctly predicted that the ANC would lose their majority, even though they would remain the leading party in the country and predicted that the MK party would overtake the EFF to be the 3rd largest party in the country. The predictions for the MK party were surprisingly good, considering they were a new party, and the model predicted their support to be close to 14% of the votes at a time in the counting process when the scoreboard was only showing their support to be around 8%. Overall, predictions for the top 6 parties performed well from around 10% of the VDs declared, with the ANC prediction taking longer to stabilize than the other parties but remaining within the 2% error margin once predictions were released.
    
    Speaker: Dr Paul Mokilane
- 10:10 → 10:30
  
  Break 20m
- 10:30 → 11:35
  Session: DIRISA
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 10:40
    
    Update on NICIS 100Gbps Data Transfer Services – 1TB in 3mins! 15m
    
    Moving large amounts of data poses a significant challenge. In most cases, networks optimised for business operations are neither designed nor capable of meeting the data movement requirements of data-intensive research. When scientists attempt to run data intensive applications over these so-called “general-purpose” or enterprise networks, the result is often poor performance. In many cases, this poor performance significantly impacts the scientific mission, leading to challenges such as not receiving data on time or resorting to “desperate” measures, such as physically shipping disks.
    
    There has been a significant increase in available network capacity and a greater need to be able to transfer large amounts of data efficiently. The CSIR through SANReN, NICIS offers a Data Transfer Service as an effort to facilitate the movement of large datasets by South African researchers and scientists.
    
    With the implementation of the SANReN 100Gbps backbone network capacity, 100Gbps Data Transfer Nodes have been implemented in Cape Town and Johannesburg with Globus (globus.org) data transfer software installed. The best international data transfer results seen were between Johannesburg/Cape Town DTNs and Colorado (NCAR GLADE). Data transfer speeds reached between 4.78GB/s-5.48GB/s, which resulted in 1TB of data being moved in only 3 minutes!
    
    If you have large data transfer requirements, please contact pert@sanren.ac.za for assistance.
    
    Speaker: Kasandra Pillay (SANReN)
- 11:35 → 12:30
  
  Lunch 55m
- 12:30 → 14:30
  Session: Technical
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 12:30
    
    Spatial Modelling of Irrigation Water Quality: Assessing SAR in South Africa's Agricultural Landscapes 20m
    
    The Sodium Absorption Ratio (SAR) is a critical metric used to assess the suitability of water for agricultural irrigation, reflecting the potential for sodium to accumulate in soil and negatively affect crop yield and the ecosystem. In South Africa, agriculture is a cornerstone of economic development, contributing significantly to GDP and employment. Identifying geographical locations with poor SAR measures is essential for sustaining agricultural productivity and environmental health. In this study, a generalized additive model (GAM) was employed to analyse the spatial distribution of the SAR across South Africa. The model incorporated a spatial effect based on the geographical coordinates of the sample locations. This allowed for the investigation of how geographical factors influence the SAR in various regions across South Africa, while controlling for predictors, such as other water quality parameters. The study made use of data from inorganic water chemistry analysis of samples from rivers, dams and lakes that were collected between the years 1970 to 2011 in South Africa. The significance of this research lies in its capacity to pinpoint locations with poor water quality, thereby guiding interventions aimed at soil and water management to avert potential degradation of arable land. The findings of this study not only aid in optimizing resource allocation for improving water quality but also contribute to the broader objectives of sustainable agricultural practices and economic stability in South Africa.
    
    Alongside this study, an interactive dashboard in under development for the monitoring and evaluation of water quality data in South Africa. The dashboard incorporates visualizations and important summary measures for SAR as well as other various water quality parameters. This tool democratizes access to vital information, enabling stakeholders to make informed decisions based on comprehensive water data analysis and visualizations. This tool not only enhances transparency and accountability but also facilitates a more targeted and efficient allocation of resources towards improving water quality initiatives in South Africa.
    
    Speakers: Dr Danielle Roberts (University of KwaZulu-Natal), Dr Xolani Nocanda (CSIR Water Centre)
  - 12:50
    
    AI model for securing Internet of Things communication systems in smart agriculture. 20m
    
    The rapid increase of Internet of Things (IoT) devices in smart agriculture has enabled a more connected and intelligent world. IoT devices are a collection of interconnected systems that can communicate, share data and information to achieve an automated environment. Smart agriculture presents a transformative approach to farming that leverages technology and data-driven solutions to address the challenges of modern agriculture, including the need to sustain a growing global population while minimising environmental impact and resource depletion. However, the increase in the deployment of IoT systems has led to an increase in cyber-attacks and security challenges. Moreover, security challenges such as man-in-the-middle, denial and distributed denial of service, botnets, sinkhole and spoofing attacks compromise the confidentiality, integrity and availability (CIA) of smart agriculture. This study investigates measures deployed for anomaly detection and prevention in IoT smart agriculture communication systems. Furthermore, the study proposes a model that incorporates machine learning techniques to identify and predict anomalies in loT communication systems and adapt security measures dynamically. Python is used to develop the proposed model and tested on accuracy, recall, f1-score, precision, true positive rate, false positive rate metrics. The IoT-based Datasets CIC-IDS2018, ToN-IoT and Edge-IIoTset are used to evaluate the performance and efficiency of the proposed model.
    
    Speaker: Issah Ngomane (University of Mpumalanga)
  - 13:10
    
    The Significance of CoreTrustSeal Certification: A Case Study of Stellenbosch University 20m
    
    Abstract:
    The Significance of CoreTrustSeal Certification: A Case Study of Stellenbosch University
    Stellenbosch University’s (SU) recent achievement of CoreTrustSeal (CTS) certification represents a significant milestone in the institution's commitment to excellence in research data management. This certification, granted to its data repository, SUNScholarData, underscores the university's dedication to upholding international standards of data integrity, accessibility, and sustainability.
    CTS is a globally recognized certification awarded to data repositories that demonstrate adherence to best practices in data management (Dillo and Leeuw, 2018). For SU, this certification signifies compliance with international standards, including the FAIR principles, ensuring that research data is findable, accessible, interoperable, and reusable. Moreover, it highlights the university's commitment to maintaining data integrity and quality, thus enhancing the credibility of its research outputs.
    To attain CTS certification, SU had to meet a comprehensive set of requirements outlined by the CTS Board. These requirements encompassed data integrity and authenticity, appraisal criteria, documented storage procedures, preservation planning, data quality assurance, workflows, data discovery and identification, and data reuse capabilities.
    Beyond compliance, CTS certification reinforces SU’s commitment to long-term data accessibility and preservation. By securely storing and making research data accessible for future use, the university contributes to the longevity of scholarly contributions and facilitates collaboration across diverse research projects. Additionally, the certification promotes interoperability, enabling seamless data sharing and integration with other institutions.
    Stellenbosch University's attainment of CTS certification enhances the trustworthiness and credibility of its data repository among researchers, funders, and the broader academic community. It positions the university as a trusted hub for valuable research data and highlights its dedication to responsible data management practices. As Stellenbosch University continues to advance in research and innovation, CTS certification serves as a guiding beacon towards a future where data is managed, preserved, and shared responsibly for the benefit of the global research community.
    The aim of this presentation will be to share the experiences of SU Library and Information Service as it went through the rigorous process of applying for the CTS certification. Both presenters have been intimately involved in the process from the Cape Peninsula University of Technology (which is yet to receive a certification) and SU. It is hoped that the experiences shared may trigger more interest from other institutional repositories to follow the same process.
    
    References:
    DILLO, I. & LEEUW, L. D. 2018. CoreTrustSeal. Mitteilungen der Vereinigung Österreichischer Bibliothekarinnen & Bibliothekare, 71, 162-170.
    
    Speaker: Mr Xabiso Xesi (Stellenbosch University)
  - 13:30
    
    Data Management in National Assessments: A Look towards the Future 20m
    
    National assessments play a critical role in evaluating educational systems and informing policy decisions. However, the effectiveness of these assessments hinges on robust data management practices. This article delves into the evolving landscape of data management in national assessments, exploring how research data repositories and services can contribute to a more secure, accessible, and sustainable future.
    The paper highlights key challenges in national assessment data management, including data security, long-term preservation, and facilitating data sharing for research and improvement. It will then showcase how research data repositories can address these challenges by providing secure storage, standardized formats, and access controls. Additionally, the article will explore how advancements in data services, such as data cleaning, harmonization, and analysis tools, can further empower researchers and policymakers to leverage national assessment data effectively. By fostering collaboration between assessment developers, researchers, and data repositories, the research ensures that national assessment data becomes a valuable resource for generations to come. This article contributes to literature on data management by exploring how research data infrastructure plays a vital role in propelling national assessments towards a data-driven future.
    
    Speaker: Dr Lucy Tambudzai Chamba (Durban University of Technology)
  - 13:50
    
    A Robust Data Visualisation Technique for Data-Driven Decisions: Illustrations from Power Demand and Supply 20m
    
    Ongoing technological advances in computing, data acquisition and the complex interactions of Sustainable Development Goals (SDG) create a natural Big Data environment for researchers and decision makers across fields and sectors to tap into. Identifying the triggers of SDG targets under such conditions is non-rivial, not only because of the large data dimensionality and non-orthogonality nature of the SDGs, but also due to the naturally arising data and information related gaps between data analysts and policy makers. We propose a cohesive data visualisation approach to bridging such gaps. The approach’s main idea derives from a cohesion between technical and non-technical data generators and consumers. It is designed to provide a visual conduit between the two parties, hence facilitating unified understanding of the role and impact of the visualised data attributes. Visualisation of SDG-related data provides a natural interdisciplinary setting for stakeholders to gain actionable insights into important patterns across the SDG spectrum. Identification of relevant data attributes and the nature of their complex interactions are fundamental to creating robust data-driven solutions. Thus, given real-time access to the visual effects of a single or set of SDGs, decision-makers can quickly grasp the significance of key attributes and make informed strategic choices before it is too late. Most importantly, the cohesive approach potentially leads to a unified understanding of the SDG project across the globe, regions and within countries. Communicating information embedded into data attributes via interactive data visualisation is pivotal in optimising operational efficiency. It enables timely identification of bottlenecks, tracking performance metrics and making timely interventions. We illustrate the approach based on a large time-series dataset obtained from the South African utility giant, ESKOM https://www.eskom.co.za/dataportal/, covering the period 01 April 2020 to 31 March 2024. The choice is motivated by ESKOM’s quest for stabilisation of the national electricity grid by balancing supply with the demand for electricity which can realistically be validated through visualisation and assessing demand forecasts. Visual patterns and forecasts show areas of attention, potential associations with other aspects of SDGs and highlight paths to a unified understanding of the triggers of SDG indicators,between data technocrats and decision makers, and open new paths to interdisciplinary research.
    
    Speaker: Dr Kassim Mwitondi (Sheffield Hallam University)
  - 14:10
    
    National Policy Data Observatory (NPDO) 20m
    
    The National Policy Data Observatory (NPDO) is a government-led initiative, currently hosted at the CSIR, with the main objective to support data-driven decision making in government on various socio-economic interventions. The NPDO was initiated at the height of the Covid-19 pandemic by the DSI, supported by the CSIR, Statistics SA and South African Revenue Service. The NPDO leverages the CSIR’s high-speed networking infrastructure (SANReN) and high-performance computing infrastructure (CHPC) under the National Integrated Cyber Infrastructure System.
    
    Speaker: Mr Ross Holder (CSIR)
- 14:30 → 14:50
  
  Break 20m
- 14:50 → 16:15
  Session: Technical
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 14:50
    
    The Future of Research Data: Open Science, FAIR Principles, and Effective Governance 20m
    
    In the realm of research, effective data governance plays a pivotal role in ensuring the integrity, accessibility, and usability of research data. We delve into the significance, methodologies, and complexities involved in establishing robust data governance frameworks tailored specifically for research data.
    Research data governance encompasses the policies, processes, and infrastructure which facilitate data management, sharing, and reuse while adhering to ethical, legal, and regulatory requirements. This abstract elucidates key components of research data governance, including data management plans, data stewardship, metadata standards, and data sharing protocols.
    Moreover, it addresses the unique challenges encountered in governing research data, such as heterogeneous data formats, disciplinary differences, and evolving data management practices. Strategies for overcoming these challenges, such as community-driven standards development, interoperable data repositories, and data curation services, are explored.
    Furthermore, it discusses emerging trends and technologies shaping the landscape of research data governance, such as open science initiatives, FAIR principles (Findable, Accessible, Interoperable, Reusable), and machine-readable data policies. It emphasises the need for collaborative efforts among researchers, institutions, funding agencies, and policymakers to foster a culture of responsible data stewardship and data sharing.
    
    Speaker: Mr Thapelo Maredi
  - 15:30
    
    Data mining, management and modelling in advancing water and sanitation systems 20m
    
    Data mining and management play an important role in advancing water and sanitation systems, ensuring the sustainable delivery of essential services. The application of data mining encompasses the collection, processing, and analysis of vast datasets derived from various sources such as sensor networks, satellite imagery, and public health records. These techniques facilitate the identification of patterns, trends, and anomalies, which are important for informed decision-making and strategic planning.
    
    By leveraging predictive analytics, water management authorities can anticipate demand fluctuations, optimise resource allocation, and enhance the efficiency of distribution networks. Similarly, in sanitation, data mining assists in monitoring system performance, detecting potential failures, and mitigating health risks by providing early warnings of contamination events. Moreover, the adoption of robust data management frameworks ensures the integration, storage, and accessibility of diverse datasets, supporting real-time monitoring and long-term strategic initiatives. Challenges such as data privacy, accuracy, and the need for interdisciplinary collaboration need to be addressed to ensure the reliability and efficacy of these systems. The convergence of data mining and management in water and sanitation sectors holds significant promise for enhancing operational efficiency, ensuring resource sustainability, and safeguarding public health.
    
    Data integration poses a significant hurdle due to varying formats and structures across different sources. The main challenge is data quality and accuracy with issues like missing values and outliers. Water databases may contain diverse types of data, including spatial, temporal, and multi-dimensional information. Integrating and reconciling these different types of data can be challenging, especially when they come from various sources with distinct formats and structures.
    
    Machine learning is a rapidly expanding field of computer science with diverse applications. Understanding seasonal rainfall changes is crucial for both academic and societal objectives. Current data on surface and groundwater, including water quality and quantity was reviewed. Collecting real-time data, using automated sensors, and integrating remote sensing technologies helped understand water quality dynamics. Data from the Geographical Information System (GIS) was collected to better understand the spatial distribution of water quality and quantity factors. Integrating GIS data enhances our understanding of water resources.
    
    After collection, data was cleaned for accuracy and dependability. After analysing the water datasets, it was found that the random forest (RF) method outperforms all the water quality classification models tested in this project, with a good combination of precision and recall across both classes. Although support vector machines are good at identifying negative classes, they struggle with positive ones. Linear regression, also known as logistic regression, has limits when separating water quality groups. Although decision tree models have balanced performance, there is still potential for development. When estimating water volume, both linear regression and RF models do moderately well, but the latter struggles to capture the underlying patterns. A negative R2 score for the RF model implies a lack of substantial predictive potential, necessitating additional research or evaluation of alternative models.
    
    Speaker: Dr Ridhwaan Suliman (CSIR)
  - 15:50
    
    Big Data Analytics for Intelligent Temperature Prediction in CNC Machining using PCA and AI 20m
    
    Abstract. Temperature prediction is crucial in CNC machining to prevent overheating, tool damage, and surface finish quality. This study presents a big data analytics framework for intelligent temperature prediction in CNC machining using Principal Component Analysis (PCA) and Artificial Intelligence (AI). The following methodology has been followed during the CNC machining of EN18 Steel: data collection from laboratory CNC machining, normalization of the collected data to ensure consistency and comparability, application of PCA to the preprocessed data to reduce dimensionality, training of AI models (ANN, ANFIS and Random Forest) using PCA-extracted features and temperature data, performance evaluation of the AI models using mean absolute error and coefficient of determination, utilization of the trained AI model to predict temperature values for new, unseen data, comparison of the AI model results to those of the traditional Linear Regression Model. The proposed approach predicts temperature with high accuracy of above 95%. The results show improved prediction performance compared to the traditional linear regression method, demonstrating the effectiveness of intelligent big data analytics and AI in CNC machining. This research contributes to the development of Industry 4.0 technologies, enhancing manufacturing efficiency, productivity, and product quality.
    Keywords: Big Data analytics, Principal Component Analysis (PCA), Artificial Intelligence (AI), Temperature Prediction.
    
    Speakers: ZVIKOMBORERO HWEJU (Lecturer, Chinhoyi University of Technology, Zimbabwe), Ms Varaidzo Dandira-Chibaya (Lecturer, Chinhoyi University of Technology)
  - 16:10
    
    Wrap Up Day 1 5m
    
    Wrap Up Day 1
    by
    Mrs Ina Smith
    Programme Director;
    
    Speaker: Ina Smith (Academy of Science of South Africa (ASSAf))
Wednesday, 3 July
- Tue, 2 Jul
- Wed, 3 Jul
- 07:00 → 08:00
  
  Arrival, Registration and Breakfast 1h
- 08:00 → 09:45
  Session: Technical
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 08:00
    
    Welcome 5m
    
    Welcome by Programme Director,
    by
    Mrs Ina Smith
    ASSAf Planning Manager
  - 08:05
    
    Machine-learning algorithms for mapping LULC of the uMngeni catchment area, KwaZulu-Natal 20m
    
    Abstract: Analysis of land use/land cover (LULC) in the catchment areas is the first action toward safeguarding the freshwater resources. The LULC information in the watershed has gained popularity in the natural science field as it helps water resource managers and environmental health specialists develop natural resource conservation strategies based on available quantitative in-formation. Thus, remote sensing is the cornerstone in addressing environmental-related issues at the catchment level. In this study, the performance of four machine learning algorithms (MLAs), such as Random Forests (RF), Support Vector Machine (SVM), Artificial Neural Networks (ANN), and Naïve Bayes (NB) was investigated to classify the catchment into nine rele-vant classes of the undulating watershed landscape using Landsat 8 Operational Land Imager (L8-OLI) imagery. The assessment of the MLAs were based on the visual inspection of the analyst and the commonly used assessment metrics, such as user’s accuracy (UA), producers’ accuracy (PA), overall accuracy (OA), and kappa coefficient. The MLAs produced good results, where RF (OA= 97.02%, Kappa= 0.96), SVM (OA= 89.74 %, Kappa= 0.88), ANN (OA= 87%, Kappa= 0.86), and NB (OA= 68.64 Kappa= 0.58). The results show the outstanding performance of the RF model over SVM and ANN with a small margin. While NB yielded satisfactory results, which could be primarily influenced by its sensitivity to limited training samples. In contrast, the robust per-formance of RF could be due to an ability to classify high-dimensional data with limited train-ing data.
    Keywords: uMngeni River Catchment; Machine learning; LULC; Landsat 8; Remote sensing
    
    Speaker: ORLANDO BHUNGENI (University of KwaZulu Natal)
  - 08:25
    
    Exploring Technical Challenges in Data Science Fundamentals with Python for First-year Students at a Rural South African University 20m
    
    Abstract. Most Eastern Cape rural schools operate in a disadvantaged context and are struggling to raise standards. In many rural areas, access to quality education and access to devices or other resources in information technology and computer science may be limited, making such an initiative particularly impactful.
    The current state of high schools has a huge impact on students going to universities. As an example. majority of first-year Information Technology Diploma students in the selected University, come from rural schools in the Eastern Cape (EC). Introducing data science with Python to first-year students at the selected university presents an exceptional opportunity to equip students with essential skills for the digital age while addressing the specific challenges of the local context.
    In order to extract insights from data, data science is an interdisciplinary field that integrates statistical analysis, machine learning, and domain expertise. It is becoming more and more significant in a variety of global sectors. The university can help first-year students develop the critical thinking and problem-solving abilities necessary for success in the twenty-first century, as well as prepare them for future employment. Python, a useful and beginner-friendly programming language, is complementary for introducing data science concepts to beginner students. It an ideal choice for introductory courses. Moreover, Python's popularity in both industry and academia ensures that students will acquire skills relevant to their future careers. Practical, hands-on exercises can be incorporated into the course to address the unique requirements and challenges faced by rural students. To guarantee that every student has an equal chance to succeed, the institution can also offer support services like mentoring, tutoring, and access to computer labs and internet resources.
    
    Introducing data science as supplementary content to these first-year students is a challenge for disadvantaged students without this course background and access to devices, resulting in confusion, anxiety and frustration.
    Many of the students entering the university lack basic computer and digital skills and have no access to devices, in addition to the English language as a medium of instruction used in programming. The paper focuses on some of the best approaches and support tools as well as resources for assisting disadvantaged students, and we reflect on how they have worked out for any given computer programming problem-solving task.
    
    Keywords: Data Science, Programming, Digital Skills, Information Technology
    
    Speaker: Ms Sibukele Gumbo (Walter Sisulu University)
  - 09:25
    
    Investigating the Prospects of Blockchain Technology in an Electoral System 20m
    
    Voting is part of our fundamental rights. It is crucial to have open and fair elections. As part of taking part in the election process, those voting need to have the utmost trust and confidence in the election processes and its outcome. When the underlying trust, confidence, and respect for the election is eroded by news of vote manipulation and disregard of processes by the officials overseeing the election, the voters are less interested in voting as they believe the elections are rigged and serves no purpose to participate in such an important exercise. The current voting systems which are often inundated by problems such as voter fraud, ballot tampering, and lack of transparency, necessitate a robust, secure, and transparent alternative. This research attempts to address some of the issues relating to having an unsecured voting system. The focus of the research is to investigate the prospects of implementing a blockchain based electoral system. Implementation of the blockchain technology, a decentralized ledger technology which is characterized by its immutability, transparency, and security, might address some of the security challenges with the current electoral systems. Furthermore, the study examines the principles of blockchain technology and evaluates its application in various stages of the electoral processes.
- 09:45 → 10:15
  
  Break 30m
- 10:15 → 11:55
  Session: Technical
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 10:35
    
    Securing Identity using Biometrics and Zero Knowledge Proofs 20m
    
    Data protection and cybersecurity are distinct concepts, but they complement each other. Data protection ensures data integrity, while cybersecurity protects the digital ecosystem from threats like cyberattacks and malware. In an interconnected digital world, identity is increasingly stored, shared, and processed as data, including biometrics and Personal Identifiable Information. To address these challenges, a method using Zero Knowledge Proofs (ZKPs) is proposed to protect biometrics information and secure identities. ZKPs allow one party to prove a statement is true without revealing additional information, ensuring confidentiality and security without exposing the biometric information. This approach not only enhances biometric data security but also addresses privacy concerns associated with storing and transmitting sensitive information.
    
    Speakers: Mrs Sthembile Ntshangase (Researcher), Ms Siphelele Myaka (Cybersecurity Researcher)
  - 10:55
    
    Unpacking the role of information specialist at a 21st century academic library: Research Data Management at the University of Pretoria. 20m
    
    Research data management (RDM) is rapidly becoming an essential service. This has forced higher education institutions (HEIs), research councils, publishers and funding agencies to embark on this journey. In February 2015, in alignment with this global trend, the South African National Research Foundation (NRF) released a statement on Open Access (OA) to Research Publications' funded by NRF. The statement states that research papers fully or partially funded by the NRF should be deposited to the administering institutional repository with an embargo period of not more than 12 months. Furthermore, the statement states that “the data supporting the publication should be deposited in a trusted Open Access repository, with the provision of a Digital Object Identifier (DOI) for future citation and referencing.” In support of the NRF statement, the University of Pretoria (UP) as a research-intensive institution and advocate for OA, had an RDM policy (S4417/17) approved in 2017. The university in its pursuit to implement RDM infrastructure and services launched the research data repository, Figshare in 2019. After the launch, there was a change in information specialists’ roles in order to support RDM services. The information didn’t know what their role in RDM would be and hence the researcher undertook this study. The research unpacks their role, as information specialists in RDM. The University library, as the custodian of the UP’s formal RDM drive, must be setting the pace for all researchers. Information specialists employed by this academic library are expected to be knowledgeable, competent and able to advise faculty staff and researchers within the institution.
    
    Speaker: Mr Tlou Mathiba (University of Pretoria)
  - 11:35
    
    Machine Learning in HIV Testing: A Bibliometric Analysis of Published Studies 2000-2024 20m
    
    Background: The human deficiency virus (HIV) remains a devastating public health threat, affecting 39 million people globally, with approximately 60% of these cases occurring in Sub-Saharan Africa. Early detection and diagnosis of HIV are crucial for preventing the further spread of the virus, making HIV testing a pivotal tool for achieving the UNAIDS goal of ending AIDS by 2030. The World Health Organization and UNAIDS have emphasized the importance of adopting innovative testing strategies, such as those involving machine learning. Machine learning can accurately predict high-risk individuals and facilitate more effective and efficient testing methods compared to traditional approaches. Despite this advancement, there exists a knowledge gap regarding the extent to which machine learning techniques are integrated into HIV testing strategies worldwide. To address this gap, this study aimed to analyze published studies that applied machine learning to HIV from 2000 to 2024.
    Methods: This study utilized a bibliometric approach to analyze studies that were focused on the use of machine learning in HIV testing. Relevant studies were captured through the Web of Science database using synonymous keywords. The bibliometrics package in R was used to analyze the characteristics, citation patterns, and contents of 3962 articles, while VOSviewer was used to conduct network violations. The analysis focused on the yearly growth rate, citation analysis, keywords, institutions, countries, authorship, and collaboration patterns.
    Results: The analysis revealed a scientific annual growth rate of 8.8% with an international co-authorship of 44.7% and an average citation of 23.16 per document. The most relevant sources were from high-impact journals such as PLOS ONE, Aids and Behavior, Journal of Acquired Immune Deficiency Syndrome, Journal of International Aids Society, AIDs, and BMC Public Health. The USA, The United Kingdom, South Africa, China, and Canada produce the highest number of contributions. The results show that the University of California, Johns Hopkins University, Harvard University, and the University of London have the highest collaboration networks.
    Conclusion: This study identifies trends and hotspots of machine learning research related to HIV testing across various countries, institutions, journals, and authors. These insights are crucial for future researchers to understand the dynamics of research outputs in this field.
    
    Speaker: Mr Musa Jaiteh (University of Johannesburg)
- 11:55 → 12:35
  
  Lunch 40m
- 12:35 → 14:35
  Session: Technical
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))
  - 12:35
    
    Context-Based Question Answering using Large Language BERT Variant Models for Low Resourced Sotho sa Leboa Language. 20m
    
    Since reading and responding to text needs both a grasp of natural language and awareness of the outside world, it is challenging for machines to do (Akhila et al., 2023). The most difficult areas of information retrieval and natural language processing are question answering systems (QAS). The goal of the Question Answering System is to use the provided context or knowledge base to provide replies in natural language to the user's questions. Both closed and open domains can produce the answers. A closed domain's responses are limited to a specific situation, whereas open-domain systems are able to provide answers in a human-readable language from a vast knowledge base. Another issue is coming up with answers to the questions based on certain situations, as each question might have a variety of interpretations and responses based on the context to which it relates (Kumari et al., 2022). In our comprehension, this research work is the initial effort to extract answers from a context in low resourced Sesotho sa Leboa language. The Bidirectional
    Encoder Representation from Transformers (BERT) variant model such as Albert, and DistilBERT is used as the language model in this research study
    
    Speaker: Hlaudi Masethe (Tshwane University of Technology)
  - 13:15
    
    Big data and machine learning skills, experience, methods, and big data usage. Empirical evidence from manufacturing firms in Zimbabwe. 20m
    
    The use of big data and machine learning in decision making processes in manufacturing industries is gaining momentum as manufacturers seek to enhance production performance and competitiveness. Big data technologies have transformed manufacturing decision-making, resulting in data-driven approaches. To compete in today's dynamic market, manufacturing organizations must adapt and evolve, which necessitates the effective use of data for forecasting future events and making decisions. Advanced analytics applied to large datasets allows firms to acquire deeper insights, spot patterns, forecast future trends, and optimize processes. Limitations of conventional data processing methods create new opportunities for development and innovation. However, in developing countries such as Zimbabwe the benefits of the use of big data are not fully realized in manufacturing industries. Three essential requirements are needed to effectively use big data. These are big data skills, experience using big data, and effective data processing methods. The study focused on assessing how the three factors influence the utilization of big data in the manufacturing industry. Understanding data skills, experience and processing methods is essential in building big data management skills and improving adoption in manufacturing firms. A preliminary survey was conducted on 36 manufacturing companies in Zimbabwe. The data was analyzed using SPSS version 27. The results showed that only 16.7% of the companies were effectively using big data. The effectiveness of data processing methods, big data skills, and experience using big data, significantly affect the utilization of big data in manufacturing (p<0.05). In addition, the effectiveness of data processing methods has a positive impact on the quality of decisions and accuracy prediction level. The study therefore recommends manufacturers to train their employees on the required skills and upgrade data processing methods. In the future, to address limitations of this study, there is need to widen the sample size to produce widely generalizable results.
    Keywords: big data, machine learning, data processing, data experience.
    
    Speakers: Mrs Sostina Varaidzo Chibaya (Liaoning Technical University, China), Dr More Chinakidzwa (Higher Colleges of Technology, UAE.)
  - 13:35
    
    Enhancing Climate Services in South Africa 20m
    
    Climate Services plays a critical role in the country. It is a pertinent factor in the decision-making at various levels for almost all sectors and communities in South Africa. Unfortunately, people who need it most cannot always obtain existing climate information or find it inaccessible when it does.
    Changes in climate, both human-caused and natural, have a major impact on society, affecting areas such as the economy, water and food security, and overall health and well-being. South Africa has experienced noticeable changes, including rising average temperatures, increased frequency of extreme heat events, prolonged droughts, and intensified floods, all of which underscore the urgency of addressing climate-related challenges. Collaborative efforts by different climate service providers through the National Framework for Climate Service (NFCS) will have to be strengthened to render climate service across all users. The NFCS aims to develop a practical model that acknowledges the significance of emerging trends in producing climate service data that values consultation and involvement of climate users and the vital role users play in collaborating on climate service information. The ideas of collaborating in production and exploration are acknowledged as essential for the effective utilization of climate data in decision-making. This paper offers an overview of the current status of the NFCS implementation
    within South Africa.
    
    Speaker: Dr DAWN MAHLOBO
- 14:35 → 14:55
  
  Break 20m
- 14:55 → 16:15
  
  Session: Technical
  
  Convener: Mrs Ina Smith (Academy of Science of South Africa (ASSAf))