Sefara, Tshephisho JMbooi, Mahlatse SMashile, Katlego JRambuda, ThomphoRangata, Mapitsi R2022-12-112022-12-112022-08Sefara, T.J., Mbooi, M.S., Mashile, K.J., Rambuda, T. & Rangata, M.R. 2022. A toolkit for text extraction and analysis for natural language processing tasks. http://hdl.handle.net/10204/12565 .978-1-6654-8422-0978-1-6654-8421-3978-1-6654-8423-7DOI: 10.1109/icABCD54961.2022.9856269http://hdl.handle.net/10204/12565Text extraction is an important part of natural language processing (NLP) tasks. Most NLP tasks like text classification, machine translation, text-to-speech, text-based language identification, text summarization, and named-entity recognition involve the use of textual data. Such data is limited for low-resourced languages making it difficult to experiment advanced NLP techniques on these languages. This paper presents a Python-based toolkit for text analysis and text extraction from different types of images, documents, and audio files. The toolkit is built as a library that has functions that can be imported and utilized for text extraction.FulltextenText recognitionText categorizationBig dataNatural Language ProcessingMachine translationData communicationA toolkit for text extraction and analysis for natural language processing tasksConference PresentationSefara, T. J., Mbooi, M. S., Mashile, K. J., Rambuda, T., & Rangata, M. R. (2022). A toolkit for text extraction and analysis for natural language processing tasks. http://hdl.handle.net/10204/12565Sefara, Tshephisho J, Mahlatse S Mbooi, Katlego J Mashile, Thompho Rambuda, and Mapitsi R Rangata. "A toolkit for text extraction and analysis for natural language processing tasks." <i>2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022</i> (2022): http://hdl.handle.net/10204/12565Sefara TJ, Mbooi MS, Mashile KJ, Rambuda T, Rangata MR, A toolkit for text extraction and analysis for natural language processing tasks; 2022. http://hdl.handle.net/10204/12565 .TY - Conference Presentation AU - Sefara, Tshephisho J AU - Mbooi, Mahlatse S AU - Mashile, Katlego J AU - Rambuda, Thompho AU - Rangata, Mapitsi R AB - Text extraction is an important part of natural language processing (NLP) tasks. Most NLP tasks like text classification, machine translation, text-to-speech, text-based language identification, text summarization, and named-entity recognition involve the use of textual data. Such data is limited for low-resourced languages making it difficult to experiment advanced NLP techniques on these languages. This paper presents a Python-based toolkit for text analysis and text extraction from different types of images, documents, and audio files. The toolkit is built as a library that has functions that can be imported and utilized for text extraction. DA - 2022-08 DB - ResearchSpace DP - CSIR J1 - 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Durban, South Africa, 4-5 August 2022 KW - Text recognition KW - Text categorization KW - Big data KW - Natural Language Processing KW - Machine translation KW - Data communication LK - https://researchspace.csir.co.za PY - 2022 SM - 978-1-6654-8422-0 SM - 978-1-6654-8421-3 SM - 978-1-6654-8423-7 T1 - A toolkit for text extraction and analysis for natural language processing tasks TI - A toolkit for text extraction and analysis for natural language processing tasks UR - http://hdl.handle.net/10204/12565 ER -26284