DSpace
 

Researchspace >
General science, engineering & technology >
General science, engineering & technology >
General science, engineering & technology >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10204/1976

Title: Factors that affect the accuracy of text-based language identification
Authors: Botha, GR
Barnard, E
Keywords: Language identification systems
n-gram
Support vector machine
Text-based language identification
Issue Date: Nov-2007
Publisher: 18th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA)
Citation: Botha, GR and Barnard, E. 2007. Factors that affect the accuracy of text-based language identification. 18th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), Pietermaritzburg, Kwazulu-Natal, South Africa, 28-30 November 2007, pp 7
Abstract: The authors investigate the factors that determine the performance of text-based language identification, with a particular focus on the 11 official languages of South Africa, using n-gram statistics as features for classification. For a fixed value of n, support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. This is found to be of overriding performance, and a Na¨ive Bayesian classifier is found to be the best choice of classifier overall. For input strings of 100 characters or more accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when the languages in different families are grouped together, this corresponds to a usable 95.1% accuracy
Description: 2007: PRASA
URI: http://hdl.handle.net/10204/1976
ISBN: 978-1-86840-656-2
Appears in Collections:Human language technologies
General science, engineering & technology

Files in This Item:

File Description SizeFormat
Botha2_2007.pdf127.37 kBAdobe PDFView/Open
View Statistics

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback