dc.contributor.author |
Pretorius, R
|
|
dc.contributor.author |
Berg, A
|
|
dc.contributor.author |
Pretorius, L
|
|
dc.contributor.author |
Viljoen, B
|
|
dc.date.accessioned |
2010-04-18T12:53:07Z |
|
dc.date.available |
2010-04-18T12:53:07Z |
|
dc.date.issued |
2009-03 |
|
dc.identifier.citation |
Pretorius, R, Berg, A, Pretorius, L et al 2009. Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009, pp 66-73 |
en |
dc.identifier.isbn |
1-932432-25-6 |
|
dc.identifier.uri |
http://delivery.acm.org/10.1145/1570000/1564522/p66-pretorius.pdf?key1=1564522&key2=5403951721&coll=GUIDE&dl=GUIDE&CFID=86103539&CFTOKEN=75582888
|
|
dc.identifier.uri |
http://hdl.handle.net/10204/4030
|
|
dc.description |
Copyright: 2009 Association for Computational Linguistics. EACL Workshop on Language Technologies for African Languages, Athens, Greece, 31 March 2009 |
en |
dc.description.abstract |
Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis. |
en |
dc.language.iso |
en |
en |
dc.publisher |
Association for Computational Linguistics |
en |
dc.subject |
Setswana |
en |
dc.subject |
Setswana tokenisation |
en |
dc.subject |
Computational verb morphology |
en |
dc.subject |
Disjunctive orthography |
en |
dc.subject |
Verbal prefixal morphemes |
en |
dc.subject |
Suffixal morphemes |
en |
dc.subject |
POS tagging |
en |
dc.subject |
Shallow parsing |
en |
dc.title |
Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography |
en |
dc.type |
Conference Presentation |
en |
dc.identifier.apacitation |
Pretorius, R., Berg, A., Pretorius, L., & Viljoen, B. (2009). Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. Association for Computational Linguistics. http://hdl.handle.net/10204/4030 |
en_ZA |
dc.identifier.chicagocitation |
Pretorius, R, A Berg, L Pretorius, and B Viljoen. "Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography." (2009): http://hdl.handle.net/10204/4030 |
en_ZA |
dc.identifier.vancouvercitation |
Pretorius R, Berg A, Pretorius L, Viljoen B, Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography; Association for Computational Linguistics; 2009. http://hdl.handle.net/10204/4030 . |
en_ZA |
dc.identifier.ris |
TY - Conference Presentation
AU - Pretorius, R
AU - Berg, A
AU - Pretorius, L
AU - Viljoen, B
AB - Setswana, a Bantu language in the Sotho group, is one of the eleven official languages of South Africa. The language is characterised by a disjunctive orthography, mainly affecting the important word category of verbs. In particular, verbal prefixal morphemes are usually written disjunctively, while suffixal morphemes follow a conjunctive writing style. Therefore, Setswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two tokeniser transducers and a finite-state (rule-based) morphological analyser may be combined to effectively solve the Setswana tokenisation problem. The approach has the important advantage of bringing the processing of Setswana beyond the morphological analysis level in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. Indeed, the approach ensures that an aspect such as orthography does not obfuscate sound linguistics and, ultimately, proper semantic analysis, which remains the ultimate aim of linguistic analysis and therefore also computational linguistic analysis.
DA - 2009-03
DB - ResearchSpace
DP - CSIR
KW - Setswana
KW - Setswana tokenisation
KW - Computational verb morphology
KW - Disjunctive orthography
KW - Verbal prefixal morphemes
KW - Suffixal morphemes
KW - POS tagging
KW - Shallow parsing
LK - https://researchspace.csir.co.za
PY - 2009
SM - 1-932432-25-6
T1 - Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography
TI - Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography
UR - http://hdl.handle.net/10204/4030
ER -
|
en_ZA |