|
|
Researchspace >
General science, engineering & technology >
General science, engineering & technology >
General science, engineering & technology >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10204/4030
|
| Title: | Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography |
| Authors: | Pretorius, R Berg, A Pretorius, L Viljoen, B |
| Keywords: | Setswana Setswana tokenisation Computational verb morphology Disjunctive orthography Verbal prefixal morphemes Suffixal morphemes POS tagging Shallow parsing |
| Issue Date: | Mar-2009 |
| Publisher: | Association for Computational Linguistics |
| Citation: | Pretorius, R, Berg, A, Pretorius, L et al 2009. Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009, pp 66-73 |
| Abstract: | Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis. |
| Description: | Copyright: 2009 Association for Computational Linguistics. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009 |
| URI: | http://delivery.acm.org/10.1145/1570000/1564522/p66-pretorius.pdf?key1=1564522&key2=5403951721&coll=GUIDE&dl=GUIDE&CFID=86103539&CFTOKEN=75582888 http://hdl.handle.net/10204/4030 |
| ISBN: | 1-932432-25-6 |
| Appears in Collections: | Human language technologies General science, engineering & technology
|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
|