Prototype-based active learning for lemmatization

Daelemans, WGroenewald, HJVan Huyssteen, GB2009-10-122009-10-122009-09Daelemans, W, Groenewald, HJ and Van Huyssteen, GB. 2009. Prototype-based active learning for lemmatization. Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-701313-8502http://hdl.handle.net/10204/3646This is the author's version of the work. The definitive version was published in the Proceedings of the RANLP'2009 International Conference Recent Advances in Natural Language Processing. Borovets, Bulgaria, 14-16 September, 2009. pp 65-70Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities.enActive learningPrototype-based active classificationPBACLemmatizationPrototype theoryAfrikaans lemmatizationNatural language processingRecent Advances in Natural Language Processing Conference 2009RANLP 2009PrototypicalityText technologyLinguistic dataPrototype-based active learning for lemmatizationConference PresentationDaelemans, W., Groenewald, H., & Van Huyssteen, G. (2009). Prototype-based active learning for lemmatization. http://hdl.handle.net/10204/3646Daelemans, W, HJ Groenewald, and GB Van Huyssteen. "Prototype-based active learning for lemmatization." (2009): http://hdl.handle.net/10204/3646Daelemans W, Groenewald H, Van Huyssteen G, Prototype-based active learning for lemmatization; 2009. http://hdl.handle.net/10204/3646 .TY - Conference Presentation AU - Daelemans, W AU - Groenewald, HJ AU - Van Huyssteen, GB AB - Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL), criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context researchers explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009), researchers investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task, it addresses lemmatization, the reduction of inflected word forms to their base-form. It operationalizes prototypicality as features (i.e. word frequency and word length) of the already available training data items, and combines this with a measure of uncertainty (entropy). The paper shows that the selection of less prototypical instances first, provides performance that is better than when data is randomly selected or when state of the art AL methods are used. The researchers argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces, as there are often few regularities and many irregularities. DA - 2009-09 DB - ResearchSpace DP - CSIR KW - Active learning KW - Prototype-based active classification KW - PBAC KW - Lemmatization KW - Prototype theory KW - Afrikaans lemmatization KW - Natural language processing KW - Recent Advances in Natural Language Processing Conference 2009 KW - RANLP 2009 KW - Prototypicality KW - Text technology KW - Linguistic data LK - https://researchspace.csir.co.za PY - 2009 SM - 1313-8502 T1 - Prototype-based active learning for lemmatization TI - Prototype-based active learning for lemmatization UR - http://hdl.handle.net/10204/3646 ER -