DSpace
 

Researchspace >
General science, engineering & technology >
General science, engineering & technology >
General science, engineering & technology >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10204/5587

Title: Two approaches to gathering text corpora from the WorldWideWeb
Authors: Botha, G
Barnard, E
Keywords: Text corpora
Text collection
Web-crawling
Issue Date: Nov-2005
Publisher: PRASA
Citation: Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005
Abstract: Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results.
Description: Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005
URI: http://hdl.handle.net/10204/5587
ISBN: 0-7992-2264-X
Appears in Collections:Human language technologies
General science, engineering & technology

Files in This Item:

File Description SizeFormat
Barnard_2005.pdf35.81 kBAdobe PDFView/Open
View Statistics

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback