Most of the developing world can still be classified as under-resourced in terms of their languages. Harvesting suitable and relatively easily accessible spoken resources can drastically improve the situation. One such resource are parliamentary sessions, which in general are publicly available and are most often manually transcribed. In this investigation we present an automatic harvesting procedure which makes use of the “islands of certainty” principle to segment long utterances into more manageable shorter chunks and a garbage model to improve alignment by absorbing superfluous speech. The final harvesting approach was used to harvest 50 hours of South African Parliament audio data from a total 105 hours of raw audio data, at a GOP score of 1:94. The word alignment accuracy, performed on two parliamentary sessions, showed that over 96% of the words are within 1:07 seconds of the true position in the audio stream.
Reference:
Kleynhans, N and De Wet, F.2014. Aligning Audio Samples from the South African Parliament with Hansard Transcriptions. Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa, 27-28 November 2014, pp 122-127
Kleynhans, N., & De Wet, F. (2014). Aligning Audio Samples from the South African Parliament with Hansard Transcriptions. Pattern Recognition Association of South Africa. http://hdl.handle.net/10204/7961
Kleynhans, N, and Febe De Wet. "Aligning Audio Samples from the South African Parliament with Hansard Transcriptions." (2014): http://hdl.handle.net/10204/7961
Kleynhans N, De Wet F, Aligning Audio Samples from the South African Parliament with Hansard Transcriptions; Pattern Recognition Association of South Africa; 2014. http://hdl.handle.net/10204/7961 .