New approach to searching for string median and visualization of~string clusters
Čebyševskij sbornik, Tome 20 (2019) no. 2, pp. 93-107

Voir la notice de l'article provenant de la source Math-Net.Ru

We consider the following problem of searching the median of a string set: $$ c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b), $$ where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$, $U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has important applications, e.g., in bioinformatics in the analysis of protein sequences. However, it is known that in the general case for $U=G^{*}$ the median problem is NP-hard. Therefore, for an approximate solution, heuristics algorithms were proposed, in particular, the greedy algorithm. We give a new flexible approach based on smooth Levenshtein distance approximation $\tilde{d}$. It uses stochastic character encoding of sequences and the following formula for the edit distance: $$ d(X_{1},X_{2})=\min_{(X_{1}',X_{2}'')_{e}\subseteq X_{1}\times X_{2}} \Bigl\{\frac{1}{2}\,\|X_{1}'-X_{2}''\|_{1}+|X_{1}|-|X_{1}'|+ |X_{2}|-|X_{2}''|\Bigr\}, $$ where the minimum is taken over all subsequences $(X_{1}',X_{2}'')_{e}$ of equal length. On the one hand, stochastic coding extends the class, over which the extremum is searched. However, our main result shows that the median is the same. On the other hand, now we can use smooth optimization methods, replacing the minimum in the definition above with a smooth approximation. We give implementation details based on the gradient descent, including the recursion formulas for calculating $\tilde{d}$. Effective calculation of the approximate median allows, e.g., to use the $k$-means method for string clustering. We propose an approach to visualize such clusters based on tSNE stochastic nesting method.
Keywords: Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$-means, string cluster visualization.
@article{CHEB_2019_20_2_a6,
     author = {D. V. Gorbachev and E. P. Ofitserov},
     title = {New approach to searching for string median and visualization of~string clusters},
     journal = {\v{C}eby\v{s}evskij sbornik},
     pages = {93--107},
     publisher = {mathdoc},
     volume = {20},
     number = {2},
     year = {2019},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/}
}
TY  - JOUR
AU  - D. V. Gorbachev
AU  - E. P. Ofitserov
TI  - New approach to searching for string median and visualization of~string clusters
JO  - Čebyševskij sbornik
PY  - 2019
SP  - 93
EP  - 107
VL  - 20
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/
LA  - ru
ID  - CHEB_2019_20_2_a6
ER  - 
%0 Journal Article
%A D. V. Gorbachev
%A E. P. Ofitserov
%T New approach to searching for string median and visualization of~string clusters
%J Čebyševskij sbornik
%D 2019
%P 93-107
%V 20
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/
%G ru
%F CHEB_2019_20_2_a6
D. V. Gorbachev; E. P. Ofitserov. New approach to searching for string median and visualization of~string clusters. Čebyševskij sbornik, Tome 20 (2019) no. 2, pp. 93-107. http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/