Two approaches to gathering text corpora from the WorldWideWeb

Botha, GBarnard, E2012-02-232012-02-232005-11Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 20050-7992-2264-Xhttp://hdl.handle.net/10204/5587Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results.enText corporaText collectionWeb-crawlingTwo approaches to gathering text corpora from the WorldWideWebConference PresentationBotha, G., & Barnard, E. (2005). Two approaches to gathering text corpora from the WorldWideWeb. PRASA. http://hdl.handle.net/10204/5587Botha, G, and E Barnard. "Two approaches to gathering text corpora from the WorldWideWeb." (2005): http://hdl.handle.net/10204/5587Botha G, Barnard E, Two approaches to gathering text corpora from the WorldWideWeb; PRASA; 2005. http://hdl.handle.net/10204/5587 .TY - Conference Presentation AU - Botha, G AU - Barnard, E AB - Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results. DA - 2005-11 DB - ResearchSpace DP - CSIR KW - Text corpora KW - Text collection KW - Web-crawling LK - https://researchspace.csir.co.za PY - 2005 SM - 0-7992-2264-X T1 - Two approaches to gathering text corpora from the WorldWideWeb TI - Two approaches to gathering text corpora from the WorldWideWeb UR - http://hdl.handle.net/10204/5587 ER -