Rethinking k-means clustering in the age of massive datasets: A constant-time approach

Olukanmi, PNelwamondo, Fulufhelo VMarwala, T2020-04-122020-04-122019-12Olukanmi, P., Nelwamondo, F.V. and Marwala, T. 2019. Rethinking k-means clustering in the age of massive datasets: A constant-time approach. Neural Computing and Applications: https://doi.org/10.1007/s00521-019-04673-00941-06431433-3058https://doi.org/10.1007/s00521-019-04673-0https://link.springer.com/article/10.1007/s00521-019-04673-0#citeashttps://rdcu.be/b3w47http://hdl.handle.net/10204/11411Copyright: 2019 Springer. Due to copyright restrictions, the attached PDF file only contains the abstract of the full text item. For access to the full text item, please consult the publisher's website: https://doi.org/10.1007/s00521-019-04673-0. A free fulltext non-print version of the article can be viewed at https://rdcu.be/b3w47We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k=1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40+2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm.enk-meansClusteringScalableLarge datasetsRethinking k-means clustering in the age of massive datasets: A constant-time approachArticleOlukanmi, P., Nelwamondo, F. V., & Marwala, T. (2019). Rethinking k-means clustering in the age of massive datasets: A constant-time approach. http://hdl.handle.net/10204/11411Olukanmi, P, Fulufhelo V Nelwamondo, and T Marwala "Rethinking k-means clustering in the age of massive datasets: A constant-time approach." (2019) http://hdl.handle.net/10204/11411Olukanmi P, Nelwamondo FV, Marwala T. Rethinking k-means clustering in the age of massive datasets: A constant-time approach. 2019; http://hdl.handle.net/10204/11411.TY - Article AU - Olukanmi, P AU - Nelwamondo, Fulufhelo V AU - Marwala, T AB - We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k=1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40+2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm. DA - 2019-12 DB - ResearchSpace DP - CSIR KW - k-means KW - Clustering KW - Scalable KW - Large datasets LK - https://researchspace.csir.co.za PY - 2019 SM - 0941-0643 SM - 1433-3058 T1 - Rethinking k-means clustering in the age of massive datasets: A constant-time approach TI - Rethinking k-means clustering in the age of massive datasets: A constant-time approach UR - http://hdl.handle.net/10204/11411 ER -