|
Researchspace >
General science, engineering & technology >
General science, engineering & technology >
General science, engineering & technology >
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10204/5587
|
| Title: | Two approaches to gathering text corpora from the WorldWideWeb |
| Authors: | Botha, G Barnard, E |
| Keywords: | Text corpora Text collection Web-crawling |
| Issue Date: | Nov-2005 |
| Publisher: | PRASA |
| Citation: | Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005 |
| Abstract: | Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results. |
| Description: | Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005 |
| URI: | http://hdl.handle.net/10204/5587 |
| ISBN: | 0-7992-2264-X |
| Appears in Collections: | Human language technologies General science, engineering & technology
|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
|