DSpace
 

Researchspace >
General science, engineering & technology >
General science, engineering & technology >
General science, engineering & technology >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10204/5974

Title: Pooling ASR data for closely related languages
Authors: Van Heerden, C
Kleynhans, N
Barnard, E
Davel, M
Keywords: Speech recognition
Data pooling
Under-resourced languages
Issue Date: May-2010
Publisher: School of Computer Sciences, Universiti Sains Malaysia
Citation: Van Heerden, C, Kleynhans, N, Barnard, E and Davel, M. Pooling ASR data for closely related languages. Proceedings of the Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU 2010), Penang, Malaysia, May 2010
Abstract: We describe several experiments that were conducted to assess the viability of data pooling as a means to improve speech-recognition performance for under-resourced languages. Two groups of closely related languages from the Southern Bantu language family were studied, and our tests involved phoneme recognition on telephone speech using standard tied-triphone Hidden Markov Models. Approximately 6 to 11 hours of speech from around 170 speakers was available for training in each language. We find that useful improvements in recognition accuracy can be achieved when pooling data from languages that are highly similar, with two hours of data from a closely related language being approximately equivalent to one hour of data from the target language in the best case. However, the benefit decreases rapidly as languages become slightly more distant, and is also expected to decrease when larger corpora are available. Our results suggest that similarities in triphone frequencies are the most accurate predictor of the performance of language pooling in the conditions studied here.
Description: Proceedings of the Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU 2010), Penang, Malaysia, May 2010
URI: http://www.mica.edu.vn/sltu-2010/proceedings/Proceedings%20of%20the%202nd%20International%20Workshop%20on%20Spoken%20Languages%20Technologies%20for%20Under-resourced%20Languages.pdf
http://hdl.handle.net/10204/5974
ISBN: 978-967-5417-75-7
Appears in Collections:Human language technologies
General science, engineering & technology

Files in This Item:

File Description SizeFormat
vanHeerden_2010.pdf3.58 MBAdobe PDFView/Open
View Statistics

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback