ResearchSpace

Two approaches to gathering text corpora from the WorldWideWeb

Show simple item record

dc.contributor.author Botha, G
dc.contributor.author Barnard, E
dc.date.accessioned 2012-02-23T07:34:46Z
dc.date.available 2012-02-23T07:34:46Z
dc.date.issued 2005-11
dc.identifier.citation Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005 en_US
dc.identifier.isbn 0-7992-2264-X
dc.identifier.uri http://hdl.handle.net/10204/5587
dc.description Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005 en_US
dc.description.abstract Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results. en_US
dc.language.iso en en_US
dc.publisher PRASA en_US
dc.subject Text corpora en_US
dc.subject Text collection en_US
dc.subject Web-crawling en_US
dc.title Two approaches to gathering text corpora from the WorldWideWeb en_US
dc.type Conference Presentation en_US
dc.identifier.apacitation Botha, G., & Barnard, E. (2005). Two approaches to gathering text corpora from the WorldWideWeb. PRASA. http://hdl.handle.net/10204/5587 en_ZA
dc.identifier.chicagocitation Botha, G, and E Barnard. "Two approaches to gathering text corpora from the WorldWideWeb." (2005): http://hdl.handle.net/10204/5587 en_ZA
dc.identifier.vancouvercitation Botha G, Barnard E, Two approaches to gathering text corpora from the WorldWideWeb; PRASA; 2005. http://hdl.handle.net/10204/5587 . en_ZA
dc.identifier.ris TY - Conference Presentation AU - Botha, G AU - Barnard, E AB - Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results. DA - 2005-11 DB - ResearchSpace DP - CSIR KW - Text corpora KW - Text collection KW - Web-crawling LK - https://researchspace.csir.co.za PY - 2005 SM - 0-7992-2264-X T1 - Two approaches to gathering text corpora from the WorldWideWeb TI - Two approaches to gathering text corpora from the WorldWideWeb UR - http://hdl.handle.net/10204/5587 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record