Two approaches to gathering text corpora from the WorldWideWeb

Botha, G; Barnard, E

dc.contributor.author	Botha, G
dc.contributor.author	Barnard, E
dc.date.accessioned	2012-02-23T07:34:46Z
dc.date.available	2012-02-23T07:34:46Z
dc.date.issued	2005-11
dc.identifier.citation	Botha, G and Barnard, E. Two approaches to gathering text corpora from the WorldWideWeb. Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005	en_US
dc.identifier.isbn	0-7992-2264-X
dc.identifier.uri	http://hdl.handle.net/10204/5587
dc.description	Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, Langebaan, South Africa, 23-25 November 2005	en_US
dc.description.abstract	Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results.	en_US
dc.language.iso	en	en_US
dc.publisher	PRASA	en_US
dc.subject	Text corpora	en_US
dc.subject	Text collection	en_US
dc.subject	Web-crawling	en_US
dc.title	Two approaches to gathering text corpora from the WorldWideWeb	en_US
dc.type	Conference Presentation	en_US
dc.identifier.apacitation	Botha, G., & Barnard, E. (2005). Two approaches to gathering text corpora from the WorldWideWeb. PRASA. http://hdl.handle.net/10204/5587	en_ZA
dc.identifier.chicagocitation	Botha, G, and E Barnard. "Two approaches to gathering text corpora from the WorldWideWeb." (2005): http://hdl.handle.net/10204/5587	en_ZA
dc.identifier.vancouvercitation	Botha G, Barnard E, Two approaches to gathering text corpora from the WorldWideWeb; PRASA; 2005. http://hdl.handle.net/10204/5587 .	en_ZA
dc.identifier.ris	TY - Conference Presentation AU - Botha, G AU - Barnard, E AB - Many applications of pattern recognition to natural language processing require large text corpora in a specified language. For many of the languages of the world, such corpora are not readily available, but significant quantities of text are available on the World Wide Web. We describe and compare two approaches to gathering language-specific corpora from this resource, and show that the use of a commercial search engine as a first stage leads to good results. DA - 2005-11 DB - ResearchSpace DP - CSIR KW - Text corpora KW - Text collection KW - Web-crawling LK - https://researchspace.csir.co.za PY - 2005 SM - 0-7992-2264-X T1 - Two approaches to gathering text corpora from the WorldWideWeb TI - Two approaches to gathering text corpora from the WorldWideWeb UR - http://hdl.handle.net/10204/5587 ER -	en_ZA

Files in this item

Name: Barnard_2005.pdf

Size: 35.81Kb

Format: PDF

View/Open

This item appears in the following Collection(s)

Conference Publications

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.