ResearchSpace

Factors that affect the accuracy of text-based language identification

Show simple item record

dc.contributor.author Botha, GR
dc.contributor.author Barnard, E
dc.date.accessioned 2008-01-24T14:06:31Z
dc.date.available 2008-01-24T14:06:31Z
dc.date.issued 2007-11
dc.identifier.citation Botha, GR and Barnard, E. 2007. Factors that affect the accuracy of text-based language identification. 18th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Pietermaritzburg, Kwazulu-Natal, South Africa, 28-30 November 2007, pp 7 en
dc.identifier.isbn 978-1-86840-656-2
dc.identifier.uri http://hdl.handle.net/10204/1976
dc.description 2007: PRASA en
dc.description.abstract The authors investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa, using n-gram statistics as features for classification. For a fixed value of n, support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. This is found to be of overriding performance, and a Na¨ive Bayesian classifier is found to be the best choice of classifier overall. For input strings of 100 characters or more accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy en
dc.language.iso en en
dc.publisher 18th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA) en
dc.subject Language identification systems en
dc.subject n-gram en
dc.subject Support vector machine en
dc.subject Text-based language identification en
dc.title Factors that affect the accuracy of text-based language identification en
dc.type Conference Presentation en
dc.identifier.apacitation Botha, G., & Barnard, E. (2007). Factors that affect the accuracy of text-based language identification. 18th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). http://hdl.handle.net/10204/1976 en_ZA
dc.identifier.chicagocitation Botha, GR, and E Barnard. "Factors that affect the accuracy of text-based language identification." (2007): http://hdl.handle.net/10204/1976 en_ZA
dc.identifier.vancouvercitation Botha G, Barnard E, Factors that affect the accuracy of text-based language identification; 18th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA); 2007. http://hdl.handle.net/10204/1976 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Botha, GR AU - Barnard, E AB - The authors investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa, using n-gram statistics as features for classification. For a fixed value of n, support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. This is found to be of overriding performance, and a Na¨ive Bayesian classifier is found to be the best choice of classifier overall. For input strings of 100 characters or more accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy DA - 2007-11 DB - ResearchSpace DP - CSIR KW - Language identification systems KW - n-gram KW - Support vector machine KW - Text-based language identification LK - https://researchspace.csir.co.za PY - 2007 SM - 978-1-86840-656-2 T1 - Factors that affect the accuracy of text-based language identification TI - Factors that affect the accuracy of text-based language identification UR - http://hdl.handle.net/10204/1976 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record