Markov moment for the agglomerative method of clustering in Euclidean space
Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ, Tome 15 (2019) no. 1, pp. 76-92 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

When processing large arrays of empirical data or large-scale data, cluster analysis remains one of the primary methods of preliminary typology, which makes it necessary to obtain formal rules for calculating the number of clusters. The most common method for determining the preferred number of clusters is the visual analysis of dendrograms, but this approach is purely heuristic. The number of clusters and the end moment of the clustering algorithm depend on each other. Cluster analysis of data from $n$-dimensional Euclidean space using the “single linkage” method can consider as a discrete random process. Sequences of “minimum distances” define the trajectories of this process. The “approximation-estimating test” allows us to establish the Markov moment when the growth rate of such a sequence changes from linear to parabolic, which, in turn, may be a sign of the completion of the agglomerative clustering process. The calculation of the number of clusters is the critical problem in many cases of the automatic typology of empirical data. For example, in medicine with cytometric analysis of blood, automated analysis of texts and in other instances when the number of clusters not known in advance.
Keywords: cluster analysis, least squares method
Mots-clés : Markov moment.
@article{VSPUI_2019_15_1_a5,
     author = {A. V. Orekhov},
     title = {Markov moment for the agglomerative method of clustering in {Euclidean} space},
     journal = {Vestnik Sankt-Peterburgskogo universiteta. Prikladna\^a matematika, informatika, processy upravleni\^a},
     pages = {76--92},
     year = {2019},
     volume = {15},
     number = {1},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VSPUI_2019_15_1_a5/}
}
TY  - JOUR
AU  - A. V. Orekhov
TI  - Markov moment for the agglomerative method of clustering in Euclidean space
JO  - Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ
PY  - 2019
SP  - 76
EP  - 92
VL  - 15
IS  - 1
UR  - http://geodesic.mathdoc.fr/item/VSPUI_2019_15_1_a5/
LA  - ru
ID  - VSPUI_2019_15_1_a5
ER  - 
%0 Journal Article
%A A. V. Orekhov
%T Markov moment for the agglomerative method of clustering in Euclidean space
%J Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ
%D 2019
%P 76-92
%V 15
%N 1
%U http://geodesic.mathdoc.fr/item/VSPUI_2019_15_1_a5/
%G ru
%F VSPUI_2019_15_1_a5
A. V. Orekhov. Markov moment for the agglomerative method of clustering in Euclidean space. Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ, Tome 15 (2019) no. 1, pp. 76-92. http://geodesic.mathdoc.fr/item/VSPUI_2019_15_1_a5/

[1] Everitt B. S., Cluster analysis, John Wiley Sons Ltd, Chichester, West Sussex, UK, 2011, 330 pp. | MR | Zbl

[2] Duda R. O., Hart P. E., Stork D. G., Pattern classification, 2nd ed., Wiley, New York–Chichester, 2001, 654 pp. | MR | Zbl

[3] Calirnski T., Harabasz J., “A dendrite method for cluster analysis”, Communications in Statistics, 1974, no. 3, 1–27 | MR

[4] Baxter M. J., Exploratory multivariate analysis in archaeology, Edinburgh University Press, Edinburgh, 1994, 307 pp.

[5] Sugar C. A., James G. M., “Finding the number of clusters in a dataset”, Journal of the American Statistical Association, 98:463 (2003), 750–763 | DOI | MR | Zbl

[6] Granichin O. N., Shalymov D. S., Avros R., Volkovich Z., “A randomized algorithm for estimating the number of clusters”, Automation and Remote Control, 2011, no. 4, 86–98 (In Russian) | MR | Zbl

[7] Shalymov D. S., “Randomized method for determining the number of clusters on a data set”, Scientific and Technical Gazette of Saint Petersburg State University of Information Technologies, Mechanics and Optics, 2009, no. 5(63), 111–116 (In Russian)

[8] Zhang G., Zhang C., Zhang H., “Improved $K$-means algorithm based on density Canopy”, Knowledge-Based Systems, 145 (2018), 1–14 | DOI

[9] Jiali W., Yue Z., Xv L., “Automatic cluster number selection by finding density peaks”, 2016 2nd IEEE Intern. Conference on Computer and Communications (ICCC), IEEE Proceedings (Chengdu, China, 2016), 13–18 | DOI

[10] Cordeiro de Amorim R., Hennig C., “Recovering the number of clusters in data sets with noise features using feature rescaling factors”, Information Sciences, 324 (2015), 126–145 | DOI | MR

[11] Lozkins A., Bure V. M., “A probabilistic approach to determining the locally optimal number of clusters”, Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 13:1 (2016), 28–37 (In Russian) | MR

[12] Shalymov D. S., “Algorithms for stable clustering based on index functions and stability functions”, Stochastic optimization in computer science, 4:1-1 (2008), 236–248 (In Russian)

[13] Steinhaus H., “Sur la division des corps materiels en parties”, Bull. Acad. Polon. Sci. Cl. III, IV (1956), 801–804 | MR

[14] Lloyd S., “Least squares quantization in PCM”, IEEE Transactions on Information Theory, 28:2 (1982), 129–137 | DOI | MR | Zbl

[15] Hartigan J. A., Clustering algorithms, John Wiley Sons Inc., New York–London–Sydney–Toronto, 1975, 351 pp. | MR | Zbl

[16] Aldenderfer M. S., Blashfield R. K., Cluster analysis, Sage Publications Inc., Newburg Park, 1984, 88 pp.

[17] Wald A., Sequential analysis, John Wiley Sons Inc., New York, 1947, 212 pp. | MR

[18] Sirjaev A. N., Statistical sequential analysis: Optimal stopping rules, American Mathematical Society, New York, 1973, 174 pp. | MR | Zbl

[19] Shiryaev A. N., Optimal stopping rules, Springer, Berlin–Heidelberg, 2009, 220 pp. | MR

[20] Orekhov A. V., “Criterion for estimation of stress-deformed state of SD-materials”, AIP Conference Proceedings, 1959, 2018, 070028 pp. | DOI

[21] Orekhov A. V., “Approximation-evaluation tests for a stress-strain state of deformable solids”, Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 14:3 (2018), 230–242 (In Russian) | DOI | MR

[22] McCaffrey J., “Test run — $k$-means++ data clustering”, MSDN Magazine, 30:8 (2015), 62–68

[23] Zurochka A. V., Khaydukov S. V., Kudryavtsev I. V., Chereshnev V. A., Flow cytometry in medicine and biology, 2nd ed., Ural Branch of the Russian Academy of Sciences Publ., Yekaterinburg, 2014, 574 pp. (In Russian)

[24] Lappin S., Fox C., The handbook of contemporary semantic theory, 2nd ed., Wiley-Blackwell, 2015, 776 pp.