Neural speech synthesis for resource-scarce languages

Louw, Johannes A2020-08-182020-08-182019-12Louw, J.A. 2019. Neural speech synthesis for resource-scarce languages. In: Proceedings of the South African Forum for Artificial Intelligence, Cape Town, 4-6 December 20191613-0073http://ceur-ws.org/Vol-2540/http://ceur-ws.org/Vol-2540/FAIR2019_paper_66.pdfhttp://hdl.handle.net/10204/11541Presented in: Proceedings of the South African Forum for Artificial Intelligence, Cape Town, 4-6 December 2019Recent work in sequence-to-sequence neural networks with attention mechanisms, such as the Tacotron 2 and DCTTS architectures, have brought on substantial naturalness improvements in synthesised speech. These architectures require at least an order of magnitude more data than is generally available in resource-scarce language environments. In this paper we propose an efficient feed-forward deep neural network (DNN)-based acoustic model, using stacked bottleneck features, that together with the recently introduced LPCNet vocoder can be used in resource-scarce language environments, with corpora less than 1 hour in size, to build text-to-speech systems of high perceived naturalness. We compare traditional hidden Markov model (HMM)-based acoustic modelling for speech synthesis with the proposed architecture using the World and LPCNet vocoders, giving both objective and MUSHRA based subjective results, showing that the DNN LPCNet combination leads to more natural synthesised speech that can be confused with natural speech. The proposed acoustic model provides for an efficient implementation, with faster than real time synthesis.enDeep Neural NetworkDNNHidden Markov ModelHMMLPCNetResource-scarce languagesNeural speech synthesis for resource-scarce languagesConference PresentationLouw, J. A. (2019). Neural speech synthesis for resource-scarce languages. Ruzica Piskac. http://hdl.handle.net/10204/11541Louw, Johannes A. "Neural speech synthesis for resource-scarce languages." (2019): http://hdl.handle.net/10204/11541Louw JA, Neural speech synthesis for resource-scarce languages; Ruzica Piskac; 2019. http://hdl.handle.net/10204/11541 .TY - Conference Presentation AU - Louw, Johannes A AB - Recent work in sequence-to-sequence neural networks with attention mechanisms, such as the Tacotron 2 and DCTTS architectures, have brought on substantial naturalness improvements in synthesised speech. These architectures require at least an order of magnitude more data than is generally available in resource-scarce language environments. In this paper we propose an efficient feed-forward deep neural network (DNN)-based acoustic model, using stacked bottleneck features, that together with the recently introduced LPCNet vocoder can be used in resource-scarce language environments, with corpora less than 1 hour in size, to build text-to-speech systems of high perceived naturalness. We compare traditional hidden Markov model (HMM)-based acoustic modelling for speech synthesis with the proposed architecture using the World and LPCNet vocoders, giving both objective and MUSHRA based subjective results, showing that the DNN LPCNet combination leads to more natural synthesised speech that can be confused with natural speech. The proposed acoustic model provides for an efficient implementation, with faster than real time synthesis. DA - 2019-12 DB - ResearchSpace DP - CSIR KW - Deep Neural Network KW - DNN KW - Hidden Markov Model KW - HMM KW - LPCNet KW - Resource-scarce languages LK - https://researchspace.csir.co.za PY - 2019 SM - 1613-0073 T1 - Neural speech synthesis for resource-scarce languages TI - Neural speech synthesis for resource-scarce languages UR - http://hdl.handle.net/10204/11541 ER -