Unsupervised topic modelling on South African parliament audio data

Kleynhans, N

Unsupervised topic modelling on South African parliament audio data

http://hdl.handle.net/10204/7947

Abstract:

Using a speech recognition system to convert spoken audio to text can enable the structuring of large collections of spoken audio data. A convenient means to summarise or cluster spoken data is to identify the topic under discussion. There are many text-based topic modelling and identification techniques that become available once the audio to text conversion has occurred. These approaches allow the management and presentation of spoken audio data in a more structured way. In this work, an accurate spoken topic identification system was developed to identify a dominant topic discussed in a South African parliamentary session. This was achieved by using CMU Sphinx word recognisers to convert the conversations to word representations and latent Dirichlet allocation topic modelling techniques. The best topic identification accuracy of 92:3% was obtained on 40 topics, derived from speech recogniser transcriptions and compared to the Hansard transcriptions of National Assembly sessions of the South African Parliament.

Reference:

Kleynhans, N. 2014. Unsupervised topic modelling on South African parliament audio data. Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa, 27-28 November 2014

Kleynhans, N. (2014). Unsupervised topic modelling on South African parliament audio data. Pattern Recognition Association of South Africa. http://hdl.handle.net/10204/7947

Kleynhans, N. "Unsupervised topic modelling on South African parliament audio data." (2014): http://hdl.handle.net/10204/7947

Kleynhans N, Unsupervised topic modelling on South African parliament audio data; Pattern Recognition Association of South Africa; 2014. http://hdl.handle.net/10204/7947 .

Download RIS

Copyright: Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa, 27-28 November 2014.

Kleynhans, N

Nov 2014

Speech recognition systems
Spoken audio data
South African parliament audio data
CMU Sphinx word recognisers
Hansard transcriptions

Show full item record

Files in this item

Kleynhans3_2014.pdf

This item appears in the following Collection(s)

Conference Publications

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.

Unsupervised topic modelling on South African parliament audio data

Unsupervised topic modelling on South African parliament audio data

This item appears in the following Collection(s)

Browse

All of ResearchSpace

This Collection

Quick Links

Legislation and compliance

General Enquiries

Social Connect