New approach to searching for string median and visualization of~string clusters
Čebyševskij sbornik, Tome 20 (2019) no. 2, pp. 93-107.

Voir la notice de l'article provenant de la source Math-Net.Ru

We consider the following problem of searching the median of a string set: $$ c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b), $$ where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$, $U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has important applications, e.g., in bioinformatics in the analysis of protein sequences. However, it is known that in the general case for $U=G^{*}$ the median problem is NP-hard. Therefore, for an approximate solution, heuristics algorithms were proposed, in particular, the greedy algorithm. We give a new flexible approach based on smooth Levenshtein distance approximation $\tilde{d}$. It uses stochastic character encoding of sequences and the following formula for the edit distance: $$ d(X_{1},X_{2})=\min_{(X_{1}',X_{2}'')_{e}\subseteq X_{1}\times X_{2}} \Bigl\{\frac{1}{2}\,\|X_{1}'-X_{2}''\|_{1}+|X_{1}|-|X_{1}'|+ |X_{2}|-|X_{2}''|\Bigr\}, $$ where the minimum is taken over all subsequences $(X_{1}',X_{2}'')_{e}$ of equal length. On the one hand, stochastic coding extends the class, over which the extremum is searched. However, our main result shows that the median is the same. On the other hand, now we can use smooth optimization methods, replacing the minimum in the definition above with a smooth approximation. We give implementation details based on the gradient descent, including the recursion formulas for calculating $\tilde{d}$. Effective calculation of the approximate median allows, e.g., to use the $k$-means method for string clustering. We propose an approach to visualize such clusters based on tSNE stochastic nesting method.
Keywords: Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$-means, string cluster visualization.
@article{CHEB_2019_20_2_a6,
     author = {D. V. Gorbachev and E. P. Ofitserov},
     title = {New approach to searching for string median and visualization of~string clusters},
     journal = {\v{C}eby\v{s}evskij sbornik},
     pages = {93--107},
     publisher = {mathdoc},
     volume = {20},
     number = {2},
     year = {2019},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/}
}
TY  - JOUR
AU  - D. V. Gorbachev
AU  - E. P. Ofitserov
TI  - New approach to searching for string median and visualization of~string clusters
JO  - Čebyševskij sbornik
PY  - 2019
SP  - 93
EP  - 107
VL  - 20
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/
LA  - ru
ID  - CHEB_2019_20_2_a6
ER  - 
%0 Journal Article
%A D. V. Gorbachev
%A E. P. Ofitserov
%T New approach to searching for string median and visualization of~string clusters
%J Čebyševskij sbornik
%D 2019
%P 93-107
%V 20
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/
%G ru
%F CHEB_2019_20_2_a6
D. V. Gorbachev; E. P. Ofitserov. New approach to searching for string median and visualization of~string clusters. Čebyševskij sbornik, Tome 20 (2019) no. 2, pp. 93-107. http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/

[1] Gusfield D., Algorithms on strings, trees, and sequences: Computer science and computational biology, Cambridge University Press, New York, 1997

[2] E. P. Ofitserov, “Statistical model of T-cell receptor pereferic selection”, Izvestiya of Tula State University. Technical sciences, 2017, no. 2, 138–143

[3] E. P. Ofitserov, “Deep model of T-cell receptor selection”, Izvestiya of Tula State University. Technical sciences, 2017, no. 12-2, 350–355

[4] E. P. Ofitserov, “Motif based sequence classification”, Chebyshevskii Sbornik, 19:1 (2018), 187–199 (In Russ.) | DOI | MR | Zbl

[5] E. P. Ofitserov, “Software package for solving machine learning problems on string data using soft edit distance”, Izvestiya of Tula State University. Technical sciences, 2019, no. 5, 370–376

[6] J. T. Tou, R. C. Gonzalez, Pattern recognition principles, Addison-Wesley Publishing Company, 1974 | MR | Zbl

[7] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004 | MR | Zbl

[8] F. Casacuberta, M. de Antonio, “A greedy algorithm for computing approximate median strings”, VII Simposium Nacional de Reconocimiento de Formas y Análisis de Imágenes, 1997, 193–198

[9] C. De la Higuera, F. Casacuberta, “Topology of strings: Median string is NP-complete”, Theoretical Computer Science, 230:1–2 (2000), 39–48 | DOI | MR | Zbl

[10] M. Hayashida, H. Koyano, “Finding median and center strings for a probability distribution on a set of strings under Levenshtein distance based on integer linear programming”, Biomedical Engineering Systems and Technologies, BIOSTEC 2016, Communications in Computer and Information Science, 690, eds. Fred A., Gamboa H., Springer, Cham, 2017, 108–121

[11] F. Kruzslicz, “Improved greedy algorithm for computing approximate median strings”, Acta Cybernetica, 14:2 (1999), 331–339 | MR | Zbl

[12] C. D. Martínez-Hinarejos, A. Juan, F. Casacuberta, “Use of median string for classification”, Proc. 15th Int. Conf. on Pattern Recognition, ICPR-2000 (Barcelona, Spain, 2000), v. 2, 2000, 903–906 | DOI

[13] L. Van der Maaten, G. Hinton, “Visualizing data using t-SNE”, J. Mach. Learn. Res., 9 (2008), 2579–2605 | Zbl

[14] E. Ofitserov, V. Tsvetkov, V. Nazarov, Soft edit distance for differentiable comparison of symbolic sequences, 2019, arXiv: 1904.12562 | Zbl

[15] C. Olivares-Rodríguez, J. Oncina, “A stochastic approach to median string computation”, Structural, Syntactic, and Statistical Pattern Recognition, SSPR /SPR 2008, Lecture Notes in Computer Science, 5342, eds. da Vitoria Lobo N. et al., Springer, Berlin–Heidelberg, 2008, 431–440 | MR

[16] S. Ruder, An overview of gradient descent optimization algorithms, 2016, arXiv: 1609.04747

[17] J. W. Sammon, “A nonlinear mapping for data structure analysis”, IEEE Transactions on Computers, 18:5 (1969), 401–409

[18] W. S. Torgerson, “Multidimensional scaling I: Theory and method”, Psychometrika, 17 (1952), 401–419 | MR | Zbl

[19] L. Wang, T. Jiang, “On the complexity of multiple sequence alignment”, J. Computat. Biol., 1:4 (1994), 337–348