ResearchSpace

Rethinking k-means clustering in the age of massive datasets: A constant-time approach

Show simple item record

dc.contributor.author Olukanmi, P
dc.contributor.author Nelwamondo, Fulufhelo V
dc.contributor.author Marwala, T
dc.date.accessioned 2020-04-12T16:41:31Z
dc.date.available 2020-04-12T16:41:31Z
dc.date.issued 2019-12
dc.identifier.citation Olukanmi, P., Nelwamondo, F.V. and Marwala, T. 2019. Rethinking k-means clustering in the age of massive datasets: A constant-time approach. Neural Computing and Applications: https://doi.org/10.1007/s00521-019-04673-0 en_US
dc.identifier.issn 0941-0643
dc.identifier.issn 1433-3058
dc.identifier.uri https://doi.org/10.1007/s00521-019-04673-0
dc.identifier.uri https://link.springer.com/article/10.1007/s00521-019-04673-0#citeas
dc.identifier.uri https://rdcu.be/b3w47
dc.identifier.uri http://hdl.handle.net/10204/11411
dc.description Copyright: 2019 Springer. Due to copyright restrictions, the attached PDF file only contains the abstract of the full text item. For access to the full text item, please consult the publisher's website: https://doi.org/10.1007/s00521-019-04673-0. A free fulltext non-print version of the article can be viewed at https://rdcu.be/b3w47 en_US
dc.description.abstract We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k=1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40+2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm. en_US
dc.language.iso en en_US
dc.publisher Springer en_US
dc.relation.ispartofseries Worklist;23101
dc.subject k-means en_US
dc.subject Clustering en_US
dc.subject Scalable en_US
dc.subject Large datasets en_US
dc.title Rethinking k-means clustering in the age of massive datasets: A constant-time approach en_US
dc.type Article en_US
dc.identifier.apacitation Olukanmi, P., Nelwamondo, F. V., & Marwala, T. (2019). Rethinking k-means clustering in the age of massive datasets: A constant-time approach. http://hdl.handle.net/10204/11411 en_ZA
dc.identifier.chicagocitation Olukanmi, P, Fulufhelo V Nelwamondo, and T Marwala "Rethinking k-means clustering in the age of massive datasets: A constant-time approach." (2019) http://hdl.handle.net/10204/11411 en_ZA
dc.identifier.vancouvercitation Olukanmi P, Nelwamondo FV, Marwala T. Rethinking k-means clustering in the age of massive datasets: A constant-time approach. 2019; http://hdl.handle.net/10204/11411. en_ZA
dc.identifier.ris TY - Article AU - Olukanmi, P AU - Nelwamondo, Fulufhelo V AU - Marwala, T AB - We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k=1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40+2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm. DA - 2019-12 DB - ResearchSpace DP - CSIR KW - k-means KW - Clustering KW - Scalable KW - Large datasets LK - https://researchspace.csir.co.za PY - 2019 SM - 0941-0643 SM - 1433-3058 T1 - Rethinking k-means clustering in the age of massive datasets: A constant-time approach TI - Rethinking k-means clustering in the age of massive datasets: A constant-time approach UR - http://hdl.handle.net/10204/11411 ER - en_ZA


Files in this item

This item appears in the following Collection(s)

Show simple item record