Centre for High Performance Computing 2021 National Conference

Name: Centre for High Performance Computing 2021 National Conference
Start: 2021-12-01T10:00:00+02:00
End: 2021-12-03T20:00:00+02:00
Location: No location set

1-3 December 2021

Africa/Johannesburg timezone

Conference Videos Available

Info

helpdesk@chpc.ac.za

Optimised Code-Switched Language Model Data Augmentation in Four Under-Resourced South African Languages

Not scheduled

20m

Student Micro-talk Cognitive Computing and Machine Learning Micro-talks

Joshua Jansen van Vueren (Stellenbosch University)

Code-switching in South African languages is common but data for language modelling remains extremely scarce. We present techniques that allow recurrent neural networks (LSTMs) to be better applied as generative models to the task of producing artificial code-switched text that can be used to augment the small training sets. We propose the application of prompting to favour the generation of sentences with intra-sentential language switches, and introduce an extensive LSTM hyperparameter search that specifically optimises the utility of the artificially generated code-switched text. We use these strategies to generate artificial code-switched text for four under-resourced South African languages and evaluate the utility of this additional data for language modelling. We find that the optimised models are able to generate text that leads to consistent perplexity and word error rate improvements for all four language pairs, especially at language switches. We conclude that prompting and targeted hyperparameter optimisation are an effective means of improving language model data augmentation for code-switched speech recognition.

Joshua Jansen van Vueren (Stellenbosch University)

Thomas Niesler (University of Stellenbosch)

There are no materials yet.

Centre for High Performance Computing 2021 National Conference

Info

Optimised Code-Switched Language Model Data Augmentation in Four Under-Resourced South African Languages

Speaker

Description

Primary author

Co-author

Presentation Materials