Speech data collection in an under-resourced language within a multilingual context

Molapo, BBarnard, EDe Wet, Febe2014-08-252014-08-252014-05Molapo, R and Barnard, E and De Wet, F. 2014. Speech data collection in an under-resourced language within a multilingual context. In: 4th International Workshop on Spoken Language Technologies for Under-resourced Languages, St Petersburg, Russia, 14-16 May 2014http://hdl.handle.net/10204/76214th International Workshop on Spoken Language Technologies for Under-resourced Languages, St Petersburg, Russia, 14-16 May 2014In this paper, we present an end-to-end solution to the development of an automatic speech recognition (ASR) system in typical under-resourced languages, where the target language is likely to be influenced by one more embedded foreign languages. We first describe the collection and processing of the text corpus crawled from the World Wide Web using the Rapid Language Adaptation Toolkit. In particular, we highlight the challenges faced when foreign languages are embedded within the matrix language. Thereafter, we discuss our speech data collection efforts in under-resourced environments. We finally report on a strategy called transliteration that aids to improve recognition results of our grapheme-based automatic speech recognition system in the presence of embedded language words.enUnder-resourced languagesTransliterationMatrix languageTransliterationGrapheme-based ASRSpeech data collection in an under-resourced language within a multilingual contextConference PresentationMolapo, B., Barnard, E., & De Wet, F. (2014). Speech data collection in an under-resourced language within a multilingual context. International Research Insitute. http://hdl.handle.net/10204/7621Molapo, B, E Barnard, and Febe De Wet. "Speech data collection in an under-resourced language within a multilingual context." (2014): http://hdl.handle.net/10204/7621Molapo B, Barnard E, De Wet F, Speech data collection in an under-resourced language within a multilingual context; International Research Insitute; 2014. http://hdl.handle.net/10204/7621 .TY - Conference Presentation AU - Molapo, B AU - Barnard, E AU - De Wet, Febe AB - In this paper, we present an end-to-end solution to the development of an automatic speech recognition (ASR) system in typical under-resourced languages, where the target language is likely to be influenced by one more embedded foreign languages. We first describe the collection and processing of the text corpus crawled from the World Wide Web using the Rapid Language Adaptation Toolkit. In particular, we highlight the challenges faced when foreign languages are embedded within the matrix language. Thereafter, we discuss our speech data collection efforts in under-resourced environments. We finally report on a strategy called transliteration that aids to improve recognition results of our grapheme-based automatic speech recognition system in the presence of embedded language words. DA - 2014-05 DB - ResearchSpace DP - CSIR KW - Under-resourced languages KW - Transliteration KW - Matrix language KW - Transliteration KW - Grapheme-based ASR LK - https://researchspace.csir.co.za PY - 2014 T1 - Speech data collection in an under-resourced language within a multilingual context TI - Speech data collection in an under-resourced language within a multilingual context UR - http://hdl.handle.net/10204/7621 ER -