ResearchSpace

Orthographic measures of language distances between the official South African languages.

Show simple item record

dc.contributor.author Zulu, PN
dc.contributor.author Botha, G
dc.contributor.author Barnard, E
dc.date.accessioned 2012-01-31T10:12:31Z
dc.date.available 2012-01-31T10:12:31Z
dc.date.issued 2007
dc.identifier.citation Zulu, PN, Botha, G and Barnard, E. 2007. Orthographic measures of language distances between the official South African languages. CSIR Report (2007) en_US
dc.identifier.uri http://www.docstoc.com/docs/19459727/Orthographic-measures-of-language-distances-between-the-official
dc.identifier.uri http://hdl.handle.net/10204/5547
dc.description Copyright: 2007 CSIR Report en_US
dc.description.abstract Two methods for objectively measuring similarities and dissimilarities between the 11 official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the 11 South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to wellknown language groupings, and also suggest a finer level of detail on these relationships. en_US
dc.language.iso en en_US
dc.publisher CSIR en_US
dc.subject Language distances en_US
dc.subject Language identification systems en_US
dc.subject Levenshtein distance en_US
dc.subject Clustering en_US
dc.subject n-gram en_US
dc.subject Linguistics en_US
dc.subject Literary studies en_US
dc.subject South African languages en_US
dc.title Orthographic measures of language distances between the official South African languages. en_US
dc.type Report en_US
dc.identifier.apacitation Zulu, P., Botha, G., & Barnard, E. (2007). <i>Orthographic measures of language distances between the official South African languages</i> CSIR. Retrieved from http://hdl.handle.net/10204/5547 en_ZA
dc.identifier.chicagocitation Zulu, PN, G Botha, and E Barnard <i>Orthographic measures of language distances between the official South African languages.</i> CSIR, 2007. http://hdl.handle.net/10204/5547 en_ZA
dc.identifier.vancouvercitation Zulu P, Botha G, Barnard E. Orthographic measures of language distances between the official South African languages. 2007 [cited yyyy month dd]. Available from: http://hdl.handle.net/10204/5547 en_ZA
dc.identifier.ris TY - Report AU - Zulu, PN AU - Botha, G AU - Barnard, E AB - Two methods for objectively measuring similarities and dissimilarities between the 11 official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the 11 South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to wellknown language groupings, and also suggest a finer level of detail on these relationships. DA - 2007 DB - ResearchSpace DP - CSIR KW - Language distances KW - Language identification systems KW - Levenshtein distance KW - Clustering KW - n-gram KW - Linguistics KW - Literary studies KW - South African languages LK - https://researchspace.csir.co.za PY - 2007 T1 - Orthographic measures of language distances between the official South African languages TI - Orthographic measures of language distances between the official South African languages UR - http://hdl.handle.net/10204/5547 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record