New approach to searching for string median and visualization of~string clusters
Čebyševskij sbornik, Tome 20 (2019) no. 2, pp. 93-107
Voir la notice de l'article provenant de la source Math-Net.Ru
We consider the following problem of searching the median of a string set:
$$
c=\mathrm{argmin}_{a\in U}\sum_{b\in D}d(a,b),
$$
where $D\subset G^{*}$ is the finite set of strings over the alphabet $G$,
$U\subseteq G^{*}$, $d$ is the Levenshtein edit distance. This problem has
important applications, e.g., in bioinformatics in the analysis of protein
sequences. However, it is known that in the general case for $U=G^{*}$ the
median problem is NP-hard. Therefore, for an approximate solution, heuristics
algorithms were proposed, in particular, the greedy algorithm. We give a new
flexible approach based on smooth Levenshtein distance approximation
$\tilde{d}$. It uses stochastic character encoding of sequences and the
following formula for the edit distance:
$$
d(X_{1},X_{2})=\min_{(X_{1}',X_{2}'')_{e}\subseteq X_{1}\times X_{2}}
\Bigl\{\frac{1}{2}\,\|X_{1}'-X_{2}''\|_{1}+|X_{1}|-|X_{1}'|+
|X_{2}|-|X_{2}''|\Bigr\},
$$
where the minimum is taken over all subsequences $(X_{1}',X_{2}'')_{e}$ of
equal length. On the one hand, stochastic coding extends the class, over which
the extremum is searched. However, our main result shows that the median is the
same. On the other hand, now we can use smooth optimization methods, replacing
the minimum in the definition above with a smooth approximation. We give
implementation details based on the gradient descent, including the recursion
formulas for calculating $\tilde{d}$. Effective calculation of the approximate
median allows, e.g., to use the $k$-means method for string clustering. We
propose an approach to visualize such clusters based on tSNE stochastic nesting
method.
Keywords:
Levenshtein edit distance, string median, smooth minimum, gradient descent, $k$-means, string cluster visualization.
@article{CHEB_2019_20_2_a6,
author = {D. V. Gorbachev and E. P. Ofitserov},
title = {New approach to searching for string median and visualization of~string clusters},
journal = {\v{C}eby\v{s}evskij sbornik},
pages = {93--107},
publisher = {mathdoc},
volume = {20},
number = {2},
year = {2019},
language = {ru},
url = {http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/}
}
TY - JOUR AU - D. V. Gorbachev AU - E. P. Ofitserov TI - New approach to searching for string median and visualization of~string clusters JO - Čebyševskij sbornik PY - 2019 SP - 93 EP - 107 VL - 20 IS - 2 PB - mathdoc UR - http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/ LA - ru ID - CHEB_2019_20_2_a6 ER -
D. V. Gorbachev; E. P. Ofitserov. New approach to searching for string median and visualization of~string clusters. Čebyševskij sbornik, Tome 20 (2019) no. 2, pp. 93-107. http://geodesic.mathdoc.fr/item/CHEB_2019_20_2_a6/