Development of robust language models for speech recognition of under-resourced language

Sindana, Daniel

ULSpace Home
→
Faculty of Science and Agriculture
→
School of Mathematical & Computational Sciences
→
Theses and Dissertations (Computer Science)
→
View Item

dc.contributor.advisor	Manamela, M. J. D.
dc.contributor.author	Sindana, Daniel
dc.contributor.other	Modipa, T. I.
dc.date.accessioned	2021-07-29T08:50:55Z
dc.date.available	2021-07-29T08:50:55Z
dc.date.issued	2020
dc.identifier.uri	http://hdl.handle.net/10386/3413
dc.description	Thesis (M.Sc.(Computer Science )) -- University of Limpopo, 2020	en_US
dc.description.abstract	Language modelling (LM) work for under-resourced languages that does not consider most linguistic information inherent in a language produces language models that in adequately represent the language, thereby leading to under-development of natural language processing tools and systems such as speech recognition systems. This study investigated the influence that the orthography (i.e., writing system) of a lan guage has on the quality and/or robustness of the language models created for the text of that language. The unique conjunctive and disjunctive writing systems of isiN debele (Ndebele) and Sepedi (Pedi) were studied. The text data from the LWAZI and NCHLT speech corpora were used to develop lan guage models. The LM techniques that were implemented included: word-based n gram LM, LM smoothing, LM linear interpolation, and higher-order n-gram LM. The toolkits used for development were: HTK LM, SRILM, and CMU-Cam SLM toolkits. From the findings of the study – found on text preparation, data pooling and sizing, higher n-gram models, and interpolation of models – it is concluded that the orthogra phy of the selected languages does have effect on the quality of the language models created for their text. The following recommendations are made as part of LM devel opment for the concerned languages. 1) Special preparation and normalisation of the text data before LM development – paying attention to within sentence text markers and annotation tags that may incorrectly form part of sentences, word sequences, and n-gram contexts. 2) Enable interpolation during training. 3) Develop pentagram and hexagram language models for Pedi texts, and trigrams and quadrigrams for Ndebele texts. 4) Investigate efficient smoothing method for the different languages, especially for different text sizes and different text domains	en_US
dc.description.sponsorship	National Research Foundation (NRF) Telkom University of Limpopo	en_US
dc.format.extent	x, 97 leaves	en_US
dc.language.iso	en	en_US
dc.relation.requires	PDF	en_US
dc.subject	Language modelling	en_US
dc.subject	Natural language processing	en_US
dc.subject	Automatic speech recognition	en_US
dc.subject	Under-resourced languages	en_US
dc.subject.lcsh	Robust control	en_US
dc.subject.lcsh	Automatic speech recognition	en_US
dc.subject.lcsh	Speech perception	en_US
dc.title	Development of robust language models for speech recognition of under-resourced language	en_US
dc.type	Thesis	en_US