Metric binary trees, and nested cluster hierarchy building
Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ, Tome 20 (2024) no. 4, pp. 487-499 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Machine learning methods use data trees to organize and store information. Each of these structures has certain advantages and allows improving the quality of a particular algorithm. If all tree nodes have no more than two descendants, then it is called binary; its main advantage is the high efficiency of implementing search and sorting algorithms. In this regard, it is important to note that dendrograms of hierarchical agglomerative clustering methods are also binary trees reflecting the taxonomy of elements of a data set. Any cluster that is not a singleton can be divided into subclusters, and this allows us to form a hierarchical structure in a metric space (metric tree) with additional properties. For example, automatically set the height of the tree, considering by definition that the number of levels on which its nodes are located coincides with the number of options for dividing the sample set into clusters, subclusters, subclusters of subclusters, etc. This problem can be solved using approximation-estimation tests, the change in sensitivity of which, using the trend coefficient, allows us to obtain various clustering options. When conducting computational experiments, a synthetic set of points on the Euclidean plane was used and were studied the results of centroid-based clustering. Markov moments of stopping the clustering process were determined using approximation-estimation test. Verification of the results obtained in numerical modeling was carried out by changing the step size of the trend coefficient.
Keywords: metric tree, agglomerative clustering, least squares method.
Mots-clés : Markov moment
@article{VSPUI_2024_20_4_a4,
     author = {A. V. Orekhov and I. V. Vasiliev},
     title = {Metric binary trees, and nested cluster hierarchy building},
     journal = {Vestnik Sankt-Peterburgskogo universiteta. Prikladna\^a matematika, informatika, processy upravleni\^a},
     pages = {487--499},
     year = {2024},
     volume = {20},
     number = {4},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VSPUI_2024_20_4_a4/}
}
TY  - JOUR
AU  - A. V. Orekhov
AU  - I. V. Vasiliev
TI  - Metric binary trees, and nested cluster hierarchy building
JO  - Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ
PY  - 2024
SP  - 487
EP  - 499
VL  - 20
IS  - 4
UR  - http://geodesic.mathdoc.fr/item/VSPUI_2024_20_4_a4/
LA  - ru
ID  - VSPUI_2024_20_4_a4
ER  - 
%0 Journal Article
%A A. V. Orekhov
%A I. V. Vasiliev
%T Metric binary trees, and nested cluster hierarchy building
%J Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ
%D 2024
%P 487-499
%V 20
%N 4
%U http://geodesic.mathdoc.fr/item/VSPUI_2024_20_4_a4/
%G ru
%F VSPUI_2024_20_4_a4
A. V. Orekhov; I. V. Vasiliev. Metric binary trees, and nested cluster hierarchy building. Vestnik Sankt-Peterburgskogo universiteta. Prikladnaâ matematika, informatika, processy upravleniâ, Tome 20 (2024) no. 4, pp. 487-499. http://geodesic.mathdoc.fr/item/VSPUI_2024_20_4_a4/

[1] Mwangi D., 5 common data structures and algorithms used in machine learning (accessed: September 19, 2024) https://dzone.com/articles/5-common-data-structures-and-algorithms-used-in-ma

[2] Cormen T. H., Leiserson C. E., Rivest R. L., Stein C., Introduction to algorithms, 4$^{\rm th}$ ed., MIT Press and McGraw-Hill, Cambridge, Massachusetts, US, 2022, 1312 pp.

[3] Quinlan J. R., “Induction of decision trees”, Machine Learning, 1 (1986), 81–106 | DOI

[4] Uhlmann J. K., “Satisfying general proximity/similarity queries with metric trees”, Information Processing Letters, 40:4 (1991), 175–179 | DOI

[5] Samet H., Foundations of multidimensional and metric data structures, The Morgan Kaufmann series in data management systems, Morgan Kaufmann Publ. Inc., San Francisco, US, 2006, 1024 pp.

[6] Bozkaya T., Ozsoyoglu Z. M., “Indexing large metric spaces for similarity search queries”, ACM Trans. Database Systems, 24:3 (1999), 361–404 | DOI

[7] Brin S., “Near neighbor search in large metric spaces”, 21$^{\rm th}$ International Conference on Very Large Data Bases (VLDB 1995) (September 11–15, 1995, Zurich, Switzerland), 1995, 574–584

[8] Jambue M., Hierarchical cluster analysis and correspondences, Finance and Statistics Publ., M., 1988, 344 pp. (in Russian)

[9] Calirnski T., Harabasz J., “A dendrite method for cluster analysis”, Communications in Statistics, 1974, no. 3, 1–27

[10] Orekhov A. V., “Markov moment for the agglomerative method of clustering in Euclidean space”, Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 15:1 (2019), 76–92 (in Russian) | DOI

[11] Orekhov A. V., “Agglomerative method for texts clustering”, Lecture Notes in Computer Science, 11551, 2019, 19–32 | DOI

[12] Everitt B. S., Cluster analysis, John Wiley Sons Ltd. Press, Chichester, West Sussex, 2011, 330 pp.

[13] Duda R. O., Hart P. E., Stor D. G., Pattern classification, Wiley Press, New York, Chichester, 2001, 654 pp.

[14] Lance G. N., Williams W. T., “A general theory of classificatory sorting strategies. 1. Hierarchical systems”, The Computer Journal, 9 (1967), 373–380

[15] Milligan G. W., “Ultrametric hierarchical clustering algorithms”, Psychometrika, 44 (1979), 343–346

[16] Thorndike R. L., Who belongs in the family?, Psychometrika, 18 (1953), 267–276 | DOI

[17] Oldenderfer M. S., Blashfield R. K., “Cluster analysis”, Factor, discriminant and cluster analysis, Finance and Statistics Publ., M., 1989, 139–209 (in Russian)

[18] Orekhov A. V., “Quasi-deterministic processes with monotonic trajectories and unsupervised machine learning”, Mathematics, 9 (2021), 2301 | DOI

[19] Levin B. R., Theoretical foundations of statistical radio engineering, Radio and Communication Publ., M., 1989, 656 pp. (in Russian)

[20] Bodrunova S. S., Orekhov A. V., Blekanov I. S., Lyudkevich N. S., Tarasov N. A., “Topic detection based on sentence embeddings and agglomerative clustering with Markov moment”, Future Internet, 12 (2020), 144 | DOI

[21] Mazalov V. V., Mathematical game theory and applications, Lan' Publ., St. Petersburg, 2017, 448 pp. (in Russian)

[22] Bulinsky A. V., Shiryaev A. N., Theory of random processes, Fizmatlit Publ., M., 2003, 400 pp. (in Russian)

[23] Shiryaev A. N., Statistical sequential analysis, 2nd ed., Nauka Publ., M., 1976, 272 pp. (in Russian)

[24] Granichin O. N., Shalymov D. S., Avros R., Volkovich Z., “Randomized algorithm for finding the number of clusters”, Automation and Telemechanics, 2011, no. 4, 86–98 (in Russian)

[25] Hierarchical clustering (in Russian) (accessed: September 19, 2024) https://neerc.ifmo.ru/wiki/index.php?title=Иерархическая_кластеризация

[26] Rousseeuw P. J., “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis”, Journal of Comput. Appl. Math., 20 (1987), 53–65 | DOI

[27] Kaufman L., Rousseeuw P. J., Finding groups in data: An introduction to cluster analysis, John Wiley Sons Inc., New York, 1990, 342 pp. | DOI

[28] Fisher R. A., “The use of multiple measurements in taxonomic problems”, Annals of Eugenics, 7 (1936), 179–188 | DOI

[29] Snell-Hornby M., Translation studies: An integrated approach, John Benjamins Publ., Amsterdam, 1988, 172 pp.