Exact algorithms of searching for the largest size cluster in two integer 2-clustering problems
Sibirskij žurnal vyčislitelʹnoj matematiki, Tome 22 (2019) no. 2, pp. 121-136.

Voir la notice de l'article provenant de la source Math-Net.Ru

We consider two related discrete optimization problems of searching for a subset in a finite set of points in the Euclidean space. Both problems are induced by the versions of the fundamental problem in data analysis, namely, by selecting a subset of similar elements in a set of objects. In each problem, an input set and a positive real number are given, and it is required to find a cluster (i.e., a subset) of the largest size under constraints on the value of a quadratic clusterization function. The points in the input set which are outside the sought for subset are treated as the second (complementary) cluster. In the first problem, the function under the constraint is the sum over both clusters of the intracluster sums of the squared distances between the elements of the clusters and their centers. The center of the first (i.e., the sought) cluster is unknown and determined as the centroid, while the center of the second one is fixed at a given point in the Euclidean space (without loss of generality in the origin). In the second problem, the function under the constraint is the sum over both clusters of the weighted intracluster sums of the squared distances between the elements of the clusters and their centers. As in the first problem, the center of the first cluster is unknown and determined as the centroid, while the center of the second one is fixed in the origin. In this paper, we show that both problems are strongly NP-hard. Also, we present the exact algorithms for the cases of these problems in which the input points have integer components. If the space dimension is bounded by some constant, the algorithms are pseudopolynomial.
@article{SJVM_2019_22_2_a1,
     author = {A. V. Kel'manov and A. V. Panasenko and V. I. Khandeev},
     title = {Exact algorithms of searching for the largest size cluster in two integer 2-clustering problems},
     journal = {Sibirskij \v{z}urnal vy\v{c}islitelʹnoj matematiki},
     pages = {121--136},
     publisher = {mathdoc},
     volume = {22},
     number = {2},
     year = {2019},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/SJVM_2019_22_2_a1/}
}
TY  - JOUR
AU  - A. V. Kel'manov
AU  - A. V. Panasenko
AU  - V. I. Khandeev
TI  - Exact algorithms of searching for the largest size cluster in two integer 2-clustering problems
JO  - Sibirskij žurnal vyčislitelʹnoj matematiki
PY  - 2019
SP  - 121
EP  - 136
VL  - 22
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/SJVM_2019_22_2_a1/
LA  - ru
ID  - SJVM_2019_22_2_a1
ER  - 
%0 Journal Article
%A A. V. Kel'manov
%A A. V. Panasenko
%A V. I. Khandeev
%T Exact algorithms of searching for the largest size cluster in two integer 2-clustering problems
%J Sibirskij žurnal vyčislitelʹnoj matematiki
%D 2019
%P 121-136
%V 22
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/SJVM_2019_22_2_a1/
%G ru
%F SJVM_2019_22_2_a1
A. V. Kel'manov; A. V. Panasenko; V. I. Khandeev. Exact algorithms of searching for the largest size cluster in two integer 2-clustering problems. Sibirskij žurnal vyčislitelʹnoj matematiki, Tome 22 (2019) no. 2, pp. 121-136. http://geodesic.mathdoc.fr/item/SJVM_2019_22_2_a1/

[1] MacQueen J. B., “Some methods for classification and analysis of multivariate observations”, Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, v. 1, Univ. of California Press, Berkeley, 1967, 281–297 | MR | Zbl

[2] Rao M., “Cluster analysis and mathematical programming”, J. Amer. Stat. Assoc., 66 (1971), 622–626 | DOI | Zbl

[3] Hansen P., Jaumard B., Mladenovich N., “Minimum sum of squares clustering in a low dimensional space”, J. Classification, 15 (1998), 37–55 | DOI | MR | Zbl

[4] Hansen P., Jaumard B., “Cluster analysis and mathematical programming”, Mathematical Programming, 79 (1997), 191–215 | MR | Zbl

[5] Fisher R. A., Statistical Methods and Scientific Inference, Hafner, New York, 1956 | MR

[6] Jain A. K., “Data clustering: 50 years beyond k-means”, Pattern Recognition Letters, 31:8 (2010), 651–666 | DOI

[7] Aloise D., Deshpande A., Hansen P., Popat P., “NP-hardness of Euclidean sum-of-squares clustering”, Machine Learning, 75:2 (2009), 245–248 | DOI | Zbl

[8] Drineas P., Frieze A., Kannan R., Vempala S., Vinay V., “Clustering large graphs via the singular value decomposition”, Machine Learning, 56 (2004), 9–33 | DOI | Zbl

[9] Dolgushev A. V., Kel'manov A. V., “On the algorithmic complexity of a problem in cluster analysis”, J. of Applied and Industrial Mathematics, 5:2 (2011), 191–194 | DOI | MR

[10] Mahajan M., Nimbhorkar P., Varadarajan K., “The planar k-means problem is NP-hard”, Theoretical Computer Science, 442 (2012), 3–21 | DOI | MR

[11] Brucker P., “On the complexity of clustering problems”, Lecture Notes in Economics and Mathematical Systems, 157, 1978, 45–54 | DOI | MR | Zbl

[12] Bern M., Eppstein D., “Approximation algorithms for geometric problems”, Approximation Algorithms for NP-Hard Problems, PWS Publ., Boston, 1997, 296–345

[13] Indyk P., “A sublinear time approximation scheme for clustering in metric space”, Proc. of the 40th Ann. IEEE Symp. on Foundations of Computer Science (FOCS), 1999, 154–159 | MR

[14] de la Vega F., Kenyon C., “A randomized approximation scheme for metric max-cut”, J. of Computer and System Sciences, 63 (2001), 531–541 | DOI | MR | Zbl

[15] de la Vega F., Karpinski M., Kenyon C., Rabani Y., Electronic Colloquium on Computational Complexity (ECCC), Report No 25, 2002

[16] Hasegawa S., Imai H., Inaba M., Katoh N., Nakano J., “Efficient algorithms for variancebased k-clustering”, Proc. of the 1st Pacific Conference on Computer Graphics and Applications, Pacific Graphics'93 (Seoul, Korea), v. 1, World Scientific, River Edge, NJ, 1993, 75–89

[17] Inaba M., Katoh N., Imai H., “Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering”, extended abstract, SCG'94 Proc. of the tenth annual symposium on Computational geometry (Stony Brook, NY, USA, June 6–8, 1994), ACM, New York, 1994, 332–339 | DOI

[18] Sahni S., Gonzalez T., “P-complete approximation problems”, J. of the ACM, 23 (1976), 555–566 | DOI | MR

[19] Ageev A. A., Kel'manov A. V., Pyatkin A. V., “NP-hardness of the Euclidean max-cut problem”, Doklady Mathematics, 89:3 (2014), 343–345 | DOI | MR | Zbl

[20] Ageev A. A., Kel'manov A. V., Pyatkin A. V., “Complexity of the weighted max-cut in Euclidean space”, J. of Applied and Industrial Mathematics, 8:4 (2014), 453–457 | DOI | MR | Zbl

[21] Kel'manov A. V., Pyatkin A. V., “On the complexity of a search for a subset of “similar” vectors”, Doklady Mathematics, 78:1 (2008), 574–575 | DOI | MR | Zbl

[22] Kel'manov A. V., Pyatkin A. V., “On a version of the problem of choosing a vector subset”, J. of Applied and Industrial Mathematics, 3:4 (2009), 447–455 | DOI | MR

[23] Kel'manov A. V., Pyatkin A. V., “NP-hardness of some quadratic Euclidean 2-clustering problems”, Doklady Mathematics, 92:2 (2015), 634–637 | DOI | MR | Zbl

[24] Kel'manov A. V., Pyatkin A. V., “On the complexity of some quadratic Euclidean 2-clustering problems”, Computational Mathematics and Mathematical Physics, 56:3 (2016), 491–497 | DOI | MR | Zbl

[25] Bishop C. M., Pattern Recognition and Machine Learning, Springer Science+Business Media, LLC, New York, 2006 | MR | Zbl

[26] James G., Witten D., Hastie T., Tibshirani R., An Introduction to Statistical Learning, Springer Science+Business Media, LLC, New York, 2013 | MR | Zbl

[27] Hastie T., Tibshirani R., Friedman J., The Elements of Statistical Learning, 2nd edition, Springer-Verlag, 2009 | MR | Zbl

[28] Aggarwal C. C., Data Mining: The Textbook, Springer International Publishing, 2015 | MR | Zbl

[29] Goodfellow I., Bengio Y., Courville A., Deep Learning, Adaptive Computation and Machine Learning series, The MIT Press, 2017 | MR

[30] Shirkhorshidi A. S., Aghabozorgi S., Wah T. Y., Herawan T., “Big data clustering: a review”, LNCS, 8583, 2014, 707–720

[31] Pach J., Agarwal P. K., Combinatorial Geometry, Wiley, New York, 1995 | MR | Zbl

[32] Kel'manov A. V., Khandeev V. I., “A 2-approximation polynomial algorithm for a clustering problem”, J. of Applied and Industrial Mathematics, 7:4 (2013), 515–521 | DOI | MR | Zbl

[33] Gimadi E. Kh., Kel'manov A. V., Kel'manova M. A., Khamidullin S. A., “Aposteriornoe obnaruzhenie v chislovoy posledovatel'nosti kvaziperiodicheskogo fragmenta pri zadannom chisle povtorov”, Sib. zhurn. industr. matem., 9:1 (2006), 55–74

[34] Gimadi E.Kh., Kel'manov A. V., Kel'manova M. A., Khamidullin S. A., “A posteriori detecting a quasiperiodic fragment in a numerical sequence”, Pattern Recognition and Image Analysis, 18:1 (2008), 30–42 | DOI

[35] Baburin A. E., Gimadi E.Kh., Glebov N. I., Pyatkin A. V., “The problem of finding a subset of vectors with the maximum total weight”, J. of Applied and Industrial Mathematics, 2:1 (2008), 32–38 | DOI | MR

[36] Gimadi E.Kh., Pyatkin A. V., Rykov I. A., “On polynomial solvability of some problems of a vector subset choice in a Euclidean space of fixed dimension”, J. of Applied and Industrial Mathematics, 4:1 (2010), 48–53 | DOI | MR

[37] Shenmaier V. V., “Solving some vector subset problems by Voronoi diagrams”, J. of Applied and Industrial Mathematics, 10:4 (2016), 560–566 | DOI | MR | Zbl

[38] Kel'manov A. V., Khandeev V. I., “An exact pseudopolynomial algorithm for a problem of the two-cluster partitioning of a set of vectors”, J. of Applied and Industrial Mathematics, 9:4 (2015), 497–502 | DOI | MR | Zbl

[39] Dolgushev A. V., Kel'manov A. V., “An approximation algorithm for solving a problem of cluster analysis”, J. of Applied and Industrial Mathematics, 5:4 (2011), 551–558 | DOI | MR

[40] Dolgushev A. V., Kel'manov A. V., Shenmaier V. V., “Polynomial-time approximation scheme for a problem of partitioning a finite set into two clusters”, Proc. of the Steklov Institute of Mathematics, 295, supplement 1 (2016), 47–56 | DOI | MR

[41] Kel'manov A. V., Khandeev V. I., “Fully polynomial-time approximation scheme for a special case of a quadratic Euclidean 2-clustering problem”, J. of Applied and Industrial Mathematics, 56:2 (2016), 334–341 | MR | Zbl

[42] Kel'manov A. V., Motkova A. V., Shenmaier V. V., “An approximation scheme for a weighted two-cluster partition problem”, LNCS, 10716, 2018, 323–333

[43] Kel'manov A. V., Khandeev V. I., “A randomized algorithm for two-cluster partition of a set of vectors”, Computational Mathematics and Mathematical Physics, 55:2 (2015), 330–339 | DOI | MR | Zbl

[44] Kel'manov A. V., Motkova A. V., “Polynomial-time approximation algorithm for the problem of cardinality-weighted variance-based 2-clustering with a given center”, Computational Mathematics and Mathematical Physics, 58:1 (2018), 130–136 | DOI | MR | Zbl

[45] Kel'manov A. V., Motkova A. V., “Exact pseudopolynomial algorithms for a balanced 2-clustering problem”, J. of Applied and Industrial Mathematics, 10:3 (2016), 349–355 | DOI | MR | Zbl

[46] Kel'manov A. V., Motkova A. V., “A fully polynomial-time approximation scheme for a special Case of a balanced 2-clustering problem”, LNCS, 9869, 2016, 182–192 | MR | Zbl

[47] Kel'manov A. V., Khandeev V. I., Panasenko A. V., “Randomized algorithms for some clustering problems”, Communications in Computer and Information Science, CCIS-871 (2018), 109–119 | DOI