Pretorius, RBerg, APretorius, LViljoen, B2010-04-182010-04-182009-03Pretorius, R, Berg, A, Pretorius, L et al 2009. Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009, pp 66-731-932432-25-6http://delivery.acm.org/10.1145/1570000/1564522/p66-pretorius.pdf?key1=1564522&key2=5403951721&coll=GUIDE&dl=GUIDE&CFID=86103539&CFTOKEN=75582888http://hdl.handle.net/10204/4030Copyright: 2009 Association for Computational Linguistics. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis.enSetswanaSetswana tokenisationComputational verb morphologyDisjunctive orthographyVerbal prefixal morphemesSuffixal morphemesPOS taggingShallow parsingSetswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthographyConference PresentationPretorius, R., Berg, A., Pretorius, L., & Viljoen, B. (2009). Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. Association for Computational Linguistics. http://hdl.handle.net/10204/4030Pretorius, R, A Berg, L Pretorius, and B Viljoen. "Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography." (2009): http://hdl.handle.net/10204/4030Pretorius R, Berg A, Pretorius L, Viljoen B, Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography; Association for Computational Linguistics; 2009. http://hdl.handle.net/10204/4030 .TY - Conference Presentation AU - Pretorius, R AU - Berg, A AU - Pretorius, L AU - Viljoen, B AB - Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis. DA - 2009-03 DB - ResearchSpace DP - CSIR KW - Setswana KW - Setswana tokenisation KW - Computational verb morphology KW - Disjunctive orthography KW - Verbal prefixal morphemes KW - Suffixal morphemes KW - POS tagging KW - Shallow parsing LK - https://researchspace.csir.co.za PY - 2009 SM - 1-932432-25-6 T1 - Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography TI - Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography UR - http://hdl.handle.net/10204/4030 ER -