Wide gaps and Kleinberg’s clustering axioms for k-means

Kłopotek, Mieczysław A.

Kłopotek, Mieczysław A.

International Journal of Applied Mathematics and Computer Science, Tome 34 (2024) no. 1, pp. 135-147

Voir la notice de l'article provenant de la source Library of Science

Résumé

The widely applied k-means algorithm produces clusterings that violate our expectations with respect to high/low similarity/density within/between clusters and is in conflict with Kleinberg’s axiomatic system for distance based clustering algorithms that formalizes those expectations. In particular, k-means violates the consistency axiom. We hypothesise that this clash is due to the unexplained expectation that the data themselves should have the property of being clusterable in order to expect the algorithm clustering them to fit a clustering axiomatic system. To demonstrate this, we introduce two new clusterability properties, i.e., variational k-separability and residual k-separability, and show that then Kleinberg’s consistency axiom holds for k-means operating in the Euclidean or non-Euclidean space. Furthermore, we propose extensions of the k-means algorithm that fit approximately Kleinberg’s richness axiom, which does not hold for k-means. In this way, we reconcile k-means with Kleinberg’s axiomatic framework in Euclidean and non-Euclidean settings. Besides contribution to the theory of axiomatic frameworks of clustering and to clusterability theory, the practical benefit is the possibility to construct datasets for testing purposes of algorithms optimizing the k-means cost function. This includes a method of construction of clusterable data with a global optimum known in advance.

Keywords: clustering theory, clustering axioms, clusterability

@article{IJAMCS_2024_34_1_a9,
     author = {K{\l}opotek, Mieczys{\l}aw A.},
     title = {Wide gaps and {Kleinberg{\textquoteright}s} clustering axioms for k-means},
     journal = {International Journal of Applied Mathematics and Computer Science},
     pages = {135--147},
     publisher = {mathdoc},
     volume = {34},
     number = {1},
     year = {2024},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/IJAMCS_2024_34_1_a9/}
}

TY  - JOUR
AU  - Kłopotek, Mieczysław A.
TI  - Wide gaps and Kleinberg’s clustering axioms for k-means
JO  - International Journal of Applied Mathematics and Computer Science
PY  - 2024
SP  - 135
EP  - 147
VL  - 34
IS  - 1
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/IJAMCS_2024_34_1_a9/
LA  - en
ID  - IJAMCS_2024_34_1_a9
ER  -

%0 Journal Article
%A Kłopotek, Mieczysław A.
%T Wide gaps and Kleinberg’s clustering axioms for k-means
%J International Journal of Applied Mathematics and Computer Science
%D 2024
%P 135-147
%V 34
%N 1
%I mathdoc
%U http://geodesic.mathdoc.fr/item/IJAMCS_2024_34_1_a9/
%G en
%F IJAMCS_2024_34_1_a9

Kłopotek, Mieczysław A. Wide gaps and Kleinberg’s clustering axioms for k-means. International Journal of Applied Mathematics and Computer Science, Tome 34 (2024) no. 1, pp. 135-147. http://geodesic.mathdoc.fr/item/IJAMCS_2024_34_1_a9/

Bibliographie
Cité par

[1] Ackerman, M., Adolfsson, A. and Brownstein, N. (2016). An effective and efficient approach for clusterability evaluation, https://arxiv.org/abs/1602.06687.

[2] Ackerman, M., Ben-David, S. and Loker, D. (2010). Towards property-based classification of clustering paradigms, in J.D. Lafferty et al. (Eds), Advances in Neural Information Processing Systems 23, Curran Associates, Inc., San Francisco, pp. 10-18.

[3] Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding, in N. Bansal et al. (Eds), 11th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, SIAM, Philadelphia, pp. 1027-1035.

[4] Ben-David, S. (2015). Computational feasibility of clustering under clusterability assumptions, https://arxiv.org/abs/1501.00437.

[5] Ben-David, S. and Ackerman, M. (2009). Measures of clustering quality: A working set of axioms for clustering, in D. Koller et al. (Eds), Advances in Neural Information Processing Systems 21, Curran Associates, Inc., San Francisco, pp. 121-128.

[6] Cailliez, F. (1983). The analytical solution of the additive constant problem, Psychometrika 48(2): 305-308.

[7] Cohen-Addad, V., Kanade, V. and Mallmann-Trenn, F. (2018). Clustering redemption - Beyond the impossibility of Kleinberg’s axioms, in S. Bengio et al. (Eds), Advances in Neural Information Processing Systems, Vol. 31, Curran Associates, Inc., San Francisco.

[8] Ding, C. (2009). Dimension reduction techniques for clustering, in L. Liu and M. Oezsu (Eds), Encyclopedia of Database Systems, Springer, Boston, p. 846.

[9] Gao, Z. and Zhang, L. (2017). DPHKMS: An efficient hybrid clustering preserving differential privacy in spark, in L. Barolli et al. (Eds), Advances in Internetworking, Data & Web Technologies, Lecture Notes on Data Engineering and Communications Technologies, Vol. 6, Springer, Cham, pp. 367-377.

[10] Girolami, M. (2002). Mercer kernel-based clustering in feature space, IEEE Transactions on Neural Networks 13(3): 780-784.

[11] Hopcroft, J. and Kannan, R. (2012). Computer Science Theory for the Information Age, Chapter 8.13.2., p. 272ff, http://www.cs.cmu.edu/˜venkatg/teaching/CStheory-infoage/hopcroft-kannan-feb2012.pdf.

[12] Howland, P. and Park, H. (2008). Cluster preserving dimension reduction methods for document classification, inM. Berry and M. Castellanos (Eds), Survey of Text Mining: Clustering, Classification, and Retrieval. Second Edition, Springer, London, pp. 3-23.

[13] Keller, H., Möllering, H., Schneider, T. and Yalame, H. (2021). Privacy-preserving clustering, in S.-L. Gazdag et al. (Eds), Crypto Day Matters 32, Gesellschaft für Informatik e.V./FG KRYPTO, Bonn.

[14] Kleinberg, J. (2002). An impossibility theorem for clustering, Proceedings of the 15th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 446-453.

[15] Kłopotek, M.A. (2020). An aposteriorical clusterability criterion for k-means++ and simplicity of clustering, SN Computer Science 1(2): 80.

[16] Kłopotek, M.A. and Kłopotek, R.A. (2023). On the discrepancy between Kleinberg’s clustering axioms and k-means clustering algorithm behavior, Machine Learning 112(7): 2501-2553.

[17] Kłopotek, M. and Kłopotek, R. (2022). Richness fallacy, in G. Manco and Z.W. Raś (Eds), Foundations of Intelligent Systems, Lecture Notes in Computer Science, Vol. 13515, Springer, Cham, pp. 262-271.

[18] Kłopotek, R., Kłopotek, M. and Wierzchoń, S. (2020). A feasible k-means kernel trick under non-Euclidean feature space, International Journal of Applied Mathematics and Computer Science 30(4): 703-715, DOI: 10.34768/amcs-2020-0052.

[19] Larsen, K.G., Nelson, J., Nguyundefinedn, H.L. and Thorup, M. (2019). Heavy hitters via cluster-preserving clustering, Communications of the ACM 62(8): 95-100.

[20] Lingoes, J. (1971). Some boundary conditions for a monotone analysis of symmetric matrices, Psychometrika 36: 195-203.

[21] Lucińska, M. and Wierzchoń, S.T. (2018). Clustering based on eigenvectors of the adjacency matrix, International Journal of Applied Mathematics and Computer Science 28(4): 771-786, DOI: 10.2478/amcs-2018-0059.

[22] Madhulatha, T. (2012). An overview on clustering methods, IOSR Journal of Engineering 2(4): 719-725.

[23] Meilă, M. (2005). Comparing clusterings: An axiomatic view, Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, pp. 577-584.

[24] Ostrovsky, R., Rabani, Y., Schulman, L.J. and Swamy, C. (2013). The effectiveness of Lloyd-type methods for the k-means problem, Journal of the ACM 59(6): 28:1-28:22.

[25] Parameswaran, R. and Blough, D.M. (2005). A robust data-obfuscation approach for privacy preservation of clustered data, Proceedings of the Workshop on Privacy and Security Aspects of Data Mining, Houston, USA, p. 18-25.

[26] Ramírez, D.H. and Auñón, J.M. (2020). Privacy preserving k-means clustering: A secure multi-party computation approach, https://arxiv.org./abs/2009.10453.

[27] Roth, V., Laub, J., Kawanabe, M. and Buhmann, J. (2003). Optimal cluster preserving embedding of nonmetric proximity data, IEEE Transactions on Pattern Analysis and Machine Intelligence 25(12): 1540-1551.

[28] Sabo, K. (2014). Center-based l1-clustering method, International Journal of Applied Mathematics and Computer Science 24(1): 151-163, DOI: 10.2478/amcs-2014-0012.

[29] Strazzeri, F. and Sánchez-García, R.J. (2021). Possibility results for graph clustering: A novel consistency axiom, https://arxiv.org/abs/1806.06142.

[30] Suchy, D. and Siminski, K. (2023). GrDBSCAN: A granular density-based clustering algorithm, International Journal of Applied Mathematics and Computer Science 33(2): 297-312, DOI: 10.34768/amcs-2023-0022.

[31] van Laarhoven, T. and Marchiori, E. (2014). Axioms for graph clustering quality functions, Journal of Machine Learning Research 15: 193-215.

[32] Zhang, J., Zhu, K., Pei, Y., Fletcher, G. and Pechenizkiy, M. (2019). Cluster-preserving sampling from fully-dynamic streaming graphs, Information Sciences 482: 279-300.

[33] Zhao, Y., Tarus, S.K., Yang, L.T., Sun, J., Ge, Y. and Wang, J. (2020). Privacy-preserving clustering for big data in cyber-physical-social systems: Survey and perspectives, Information Sciences 515: 132-155.

Parcourir par

Geodesic

Parcourir par