Kleynhans, NDe Wet, Febe2015-03-182015-03-182014-11Kleynhans, N and De Wet, F.2014. Aligning Audio Samples from the South African Parliament with Hansard Transcriptions. Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa, 27-28 November 2014, pp 122-127978-0-620-62617-0http://hdl.handle.net/10204/7961Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa, 27-28 November 2014Most of the developing world can still be classified as under-resourced in terms of their languages. Harvesting suitable and relatively easily accessible spoken resources can drastically improve the situation. One such resource are parliamentary sessions, which in general are publicly available and are most often manually transcribed. In this investigation we present an automatic harvesting procedure which makes use of the “islands of certainty” principle to segment long utterances into more manageable shorter chunks and a garbage model to improve alignment by absorbing superfluous speech. The final harvesting approach was used to harvest 50 hours of South African Parliament audio data from a total 105 hours of raw audio data, at a GOP score of 1:94. The word alignment accuracy, performed on two parliamentary sessions, showed that over 96% of the words are within 1:07 seconds of the true position in the audio stream.enAudio sample alignmentHansard transcriptionsSouth African Parliament audio dataNational Centre for Human Language TechnologyAligning Audio Samples from the South African Parliament with Hansard TranscriptionsConference PresentationKleynhans, N., & De Wet, F. (2014). Aligning Audio Samples from the South African Parliament with Hansard Transcriptions. Pattern Recognition Association of South Africa. http://hdl.handle.net/10204/7961Kleynhans, N, and Febe De Wet. "Aligning Audio Samples from the South African Parliament with Hansard Transcriptions." (2014): http://hdl.handle.net/10204/7961Kleynhans N, De Wet F, Aligning Audio Samples from the South African Parliament with Hansard Transcriptions; Pattern Recognition Association of South Africa; 2014. http://hdl.handle.net/10204/7961 .TY - Conference Presentation AU - Kleynhans, N AU - De Wet, Febe AB - Most of the developing world can still be classified as under-resourced in terms of their languages. Harvesting suitable and relatively easily accessible spoken resources can drastically improve the situation. One such resource are parliamentary sessions, which in general are publicly available and are most often manually transcribed. In this investigation we present an automatic harvesting procedure which makes use of the “islands of certainty” principle to segment long utterances into more manageable shorter chunks and a garbage model to improve alignment by absorbing superfluous speech. The final harvesting approach was used to harvest 50 hours of South African Parliament audio data from a total 105 hours of raw audio data, at a GOP score of 1:94. The word alignment accuracy, performed on two parliamentary sessions, showed that over 96% of the words are within 1:07 seconds of the true position in the audio stream. DA - 2014-11 DB - ResearchSpace DP - CSIR KW - Audio sample alignment KW - Hansard transcriptions KW - South African Parliament audio data KW - National Centre for Human Language Technology LK - https://researchspace.csir.co.za PY - 2014 SM - 978-0-620-62617-0 T1 - Aligning Audio Samples from the South African Parliament with Hansard Transcriptions TI - Aligning Audio Samples from the South African Parliament with Hansard Transcriptions UR - http://hdl.handle.net/10204/7961 ER -