DSpace
 

Researchspace >
General science, engineering & technology >
General science, engineering & technology >
General science, engineering & technology >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10204/4030

Title: Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography
Authors: Pretorius, R
Berg, A
Pretorius, L
Viljoen, B
Keywords: Setswana
Setswana tokenisation
Computational verb morphology
Disjunctive orthography
Verbal prefixal morphemes
Suffixal morphemes
POS tagging
Shallow parsing
Issue Date: Mar-2009
Publisher: Association for Computational Linguistics
Citation: Pretorius, R, Berg, A, Pretorius, L et al 2009. Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009, pp 66-73
Abstract: Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis.
Description: Copyright: 2009 Association for Computational Linguistics. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009
URI: http://delivery.acm.org/10.1145/1570000/1564522/p66-pretorius.pdf?key1=1564522&key2=5403951721&coll=GUIDE&dl=GUIDE&CFID=86103539&CFTOKEN=75582888
http://hdl.handle.net/10204/4030
ISBN: 1-932432-25-6
Appears in Collections:Human language technologies
General science, engineering & technology

Files in This Item:

File Description SizeFormat
Prertorius_2009.pdf245.04 kBAdobe PDFView/Open
View Statistics

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback