Louw, Johannes A2021-04-232021-04-232020-12Louw, J.A. 2020. Text-to-speech duration models for resource-scarce languages in neural architectures. <i>Communications in Computer and Information Science, 1342.</i> http://hdl.handle.net/10204/119991865-0929DOI: https://doi.org/10.1007/978-3-030-66151-9_9http://hdl.handle.net/10204/11999Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models.FulltextenHidden Markov ModelHMMSpeech synthesisDuration modellingResource-scarce languagesText-to-speech duration models for resource-scarce languages in neural architecturesArticleLouw, J. A. (2020). Text-to-speech duration models for resource-scarce languages in neural architectures. <i>Communications in Computer and Information Science, 1342</i>, http://hdl.handle.net/10204/11999Louw, Johannes A "Text-to-speech duration models for resource-scarce languages in neural architectures." <i>Communications in Computer and Information Science, 1342</i> (2020) http://hdl.handle.net/10204/11999Louw JA. Text-to-speech duration models for resource-scarce languages in neural architectures. Communications in Computer and Information Science, 1342. 2020; http://hdl.handle.net/10204/11999.TY - Article AU - Louw, Johannes A AB - Sequence-to-sequence end-to-end models for text-to-speech have shown significant gains in naturalness of the produced synthetic speech. These models have an encoder-decoder architecture, without an explicit duration model, but rather a learned attention-based alignment mechanism, simplifying the training procedure as well as the reducing the language expertise requirements for building synthetic voices. However there are some drawbacks, attention-based alignment systems such as used in the Tacotron, Tacotron 2, Char2Wav and DC-TTS end-toend architectures typically suffer from low training efficiency as well as model instability, with several approaches attempted to address these problems. Recent neural acoustic models have moved away from using an attention-based mechanisms to align the linguistic and acoustic encoding and decoding, and have rather reverted to using an explicit duration model for the alignment. In this work we develop an efficient neural network based duration model and compare it to the traditional Gaussian mixture model based architectures as used in hidden Markov model (HMM)-based speech synthesis. We show through objective results that our proposed model is better suited to resource-scarce language settings than the traditional HMM-based models. DA - 2020-12 DB - ResearchSpace DP - CSIR J1 - Communications in Computer and Information Science, 1342 KW - Hidden Markov Model KW - HMM KW - Speech synthesis KW - Duration modelling KW - Resource-scarce languages LK - https://researchspace.csir.co.za PY - 2020 SM - 1865-0929 T1 - Text-to-speech duration models for resource-scarce languages in neural architectures TI - Text-to-speech duration models for resource-scarce languages in neural architectures UR - http://hdl.handle.net/10204/11999 ER -24343