Rethinking k-means clustering in the age of massive datasets: A constant-time approach

Olukanmi, P; Nelwamondo, Fulufhelo V; Marwala, T

dc.contributor.author	Olukanmi, P
dc.contributor.author	Nelwamondo, Fulufhelo V
dc.contributor.author	Marwala, T
dc.date.accessioned	2020-04-12T16:41:31Z
dc.date.available	2020-04-12T16:41:31Z
dc.date.issued	2019-12
dc.identifier.citation	Olukanmi, P., Nelwamondo, F.V. and Marwala, T. 2019. Rethinking k-means clustering in the age of massive datasets: A constant-time approach. Neural Computing and Applications: https://doi.org/10.1007/s00521-019-04673-0	en_US
dc.identifier.issn	0941-0643
dc.identifier.issn	1433-3058
dc.identifier.uri	https://doi.org/10.1007/s00521-019-04673-0
dc.identifier.uri	https://link.springer.com/article/10.1007/s00521-019-04673-0#citeas
dc.identifier.uri	https://rdcu.be/b3w47
dc.identifier.uri	http://hdl.handle.net/10204/11411
dc.description	Copyright: 2019 Springer. Due to copyright restrictions, the attached PDF file only contains the abstract of the full text item. For access to the full text item, please consult the publisher's website: https://doi.org/10.1007/s00521-019-04673-0. A free fulltext non-print version of the article can be viewed at https://rdcu.be/b3w47	en_US
dc.description.abstract	We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k=1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40+2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm.	en_US
dc.language.iso	en	en_US
dc.publisher	Springer	en_US
dc.relation.ispartofseries	Worklist;23101
dc.subject	k-means	en_US
dc.subject	Clustering	en_US
dc.subject	Scalable	en_US
dc.subject	Large datasets	en_US
dc.title	Rethinking k-means clustering in the age of massive datasets: A constant-time approach	en_US
dc.type	Article	en_US
dc.identifier.apacitation	Olukanmi, P., Nelwamondo, F. V., & Marwala, T. (2019). Rethinking k-means clustering in the age of massive datasets: A constant-time approach. http://hdl.handle.net/10204/11411	en_ZA
dc.identifier.chicagocitation	Olukanmi, P, Fulufhelo V Nelwamondo, and T Marwala "Rethinking k-means clustering in the age of massive datasets: A constant-time approach." (2019) http://hdl.handle.net/10204/11411	en_ZA
dc.identifier.vancouvercitation	Olukanmi P, Nelwamondo FV, Marwala T. Rethinking k-means clustering in the age of massive datasets: A constant-time approach. 2019; http://hdl.handle.net/10204/11411.	en_ZA
dc.identifier.ris	TY - Article AU - Olukanmi, P AU - Nelwamondo, Fulufhelo V AU - Marwala, T AB - We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k=1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40+2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm. DA - 2019-12 DB - ResearchSpace DP - CSIR KW - k-means KW - Clustering KW - Scalable KW - Large datasets LK - https://researchspace.csir.co.za PY - 2019 SM - 0941-0643 SM - 1433-3058 T1 - Rethinking k-means clustering in the age of massive datasets: A constant-time approach TI - Rethinking k-means clustering in the age of massive datasets: A constant-time approach UR - http://hdl.handle.net/10204/11411 ER -	en_ZA

Files in this item

Name: Olukanmi_23101_20 ...

Size: 509.9Kb

Format: PDF

Description: Abstract

View/Open

This item appears in the following Collection(s)

Journal Articles

Show simple item record

Browse

All of ResearchSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects
- Publication Type
- Cluster
- Impact Area

Quick Links

Legislation and compliance

General Enquiries

Tel: + 27 12 841 2911
Email: callcentre@csir.co.za

Physical Address
Meiring Naudé Road
Brummeria
Pretoria
South Africa

Postal Address
PO Box 395
Pretoria 0001
South Africa

Social Connect

Resources on this site are free to download and reuse according to associated licensing provision. Please read the terms and conditions of usage of each resource.