Number of overlaps in patterns
Matematičeskaâ biologiâ i bioinformatika, Tome 11 (2016) no. 1, pp. 14-23.

Voir la notice de l'article provenant de la source Math-Net.Ru

The aim of the paper is to estimate the number of overlaps in the given pattern. The pattern is a set of words of same length $m$ in an alphabet $A$. We present theoretical and experimental bounds for overlaps number in two types of patterns. Firstly, we considered random patterns which relate to uniform probability model, i.e. all letters in the alphabet and, correspondently, all words of same length are equiprobable. We proved that the average number of overlaps $P$ for random patterns consisting of $n$ words of length $m$ linearly depends on pattern size $n$ and is independent of length of pattern words. In performed computer experiments the ratio $P/n$ ranged from $0.33$ till $1.06$; the theoretical evaluations of the ratio for the patterns do not exceed $1.67$. The secondly, we studied the patterns described by position weight matrices (PWM) from the data base HOCOMOCO and various cut-offs. For such patterns the ratio $P/n$ in experiments ranged from $0.004$ till $1$, for most of the patterns it is smaller then $0.1$.
@article{MBB_2016_11_1_a1,
     author = {E. I. Furletova and M. A. Roytberg},
     title = {Number of overlaps in patterns},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {14--23},
     publisher = {mathdoc},
     volume = {11},
     number = {1},
     year = {2016},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a1/}
}
TY  - JOUR
AU  - E. I. Furletova
AU  - M. A. Roytberg
TI  - Number of overlaps in patterns
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2016
SP  - 14
EP  - 23
VL  - 11
IS  - 1
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a1/
LA  - ru
ID  - MBB_2016_11_1_a1
ER  - 
%0 Journal Article
%A E. I. Furletova
%A M. A. Roytberg
%T Number of overlaps in patterns
%J Matematičeskaâ biologiâ i bioinformatika
%D 2016
%P 14-23
%V 11
%N 1
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a1/
%G ru
%F MBB_2016_11_1_a1
E. I. Furletova; M. A. Roytberg. Number of overlaps in patterns. Matematičeskaâ biologiâ i bioinformatika, Tome 11 (2016) no. 1, pp. 14-23. http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a1/

[1] Boyer R. S., Moore J. S., “A fast string searching algorithm”, Communications ACM, 20:10 (1977), 762–772 | DOI | Zbl

[2] Knuth D., Morris J. H., Pratt J. V., “Fast pattern matching in strings”, SIAM Journal on Computing, 6:2 (1977), 323–350 | DOI | MR | Zbl

[3] Aho A. V., Corasick M. J., “Efficient string matching: An aid to bibliographic search”, Communications of the ACM, 18:6 (1975), 333–340 | DOI | MR | Zbl

[4] Crochemore M., Hancart C., Lecroq T., Algorithms on strings, Cambridge University Press, New York, 2007, 353 pp. | MR | Zbl

[5] Lothair M., Combinatorics on Words, Cambridge University Press, Cambridge, 1997, 260 pp. | MR | Zbl

[6] Duval J. P., Lecroq T., Lefebvre A., “Border array on bounded alphabet”, Journal of Automata, Languages and Combinatorics, 10:1 (2005), 51–60 | MR | Zbl

[7] Stormo G. D., “DNA binding sites: representation and discovery”, Bioinformatics, 16:1 (2000), 16–23 | DOI

[8] Durbin R., Eddy S., Krogh A., Mitchison G., Biological sequences analysis: Probabilistic models of proteins and nucleic acids, Cambridge University Press, Cambridge, 1998

[9] Kulakovskiy I., Medvedeva Y. A., Shaefer U., Kasianov A. S., Vorontsov I. E., Bajic V. B., Makeev V. J., “HOCOMOCO: A comprehensive collection of human transcription factor binding sites models”, Nucleic Acids Research, 41 (2013), 195–202 | DOI

[10] Régnier M., Furletova E., Yakovlev V., Roytberg M., “Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models”, Algorithms for Molecular Biology, 9:1 (2014) | DOI

[11] Lothaire M., “Statistics on Words with Applications to Biological Sequences”, Applied combinatorics on words, Encyclopedia of Mathematics and its Applications, Cambridge University Press, Cambridge, 2004, 610 | MR