Entropy approach to the construction of a measure of word symbolic diverseness and its application to clustering of plant genomes
Matematičeskaâ biologiâ i bioinformatika, Tome 11 (2016) no. 1, pp. 114-126.

Voir la notice de l'article provenant de la source Math-Net.Ru

An approach to the information analysis is considered for the case when the information is presented by words of finite length over a finite alphabet. A method of generating a measure of symbolic diverseness of words based on peak characteristics of a shift entropy function is proposed. The shift entropy function is formally defined using a unit translation operator and the entropy of discrete distributions. A model example is presented together with some results of application of the proposed measure in the clustering of families of plants using the analysis of genome of their representatives.
@article{MBB_2016_11_1_a7,
     author = {Yu. G. Smetanin and M. V. Ulyanov and A. S. Pestova},
     title = {Entropy approach to the construction of a measure of word symbolic diverseness and its application to clustering of plant genomes},
     journal = {Matemati\v{c}eska\^a biologi\^a i bioinformatika},
     pages = {114--126},
     publisher = {mathdoc},
     volume = {11},
     number = {1},
     year = {2016},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a7/}
}
TY  - JOUR
AU  - Yu. G. Smetanin
AU  - M. V. Ulyanov
AU  - A. S. Pestova
TI  - Entropy approach to the construction of a measure of word symbolic diverseness and its application to clustering of plant genomes
JO  - Matematičeskaâ biologiâ i bioinformatika
PY  - 2016
SP  - 114
EP  - 126
VL  - 11
IS  - 1
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a7/
LA  - ru
ID  - MBB_2016_11_1_a7
ER  - 
%0 Journal Article
%A Yu. G. Smetanin
%A M. V. Ulyanov
%A A. S. Pestova
%T Entropy approach to the construction of a measure of word symbolic diverseness and its application to clustering of plant genomes
%J Matematičeskaâ biologiâ i bioinformatika
%D 2016
%P 114-126
%V 11
%N 1
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a7/
%G ru
%F MBB_2016_11_1_a7
Yu. G. Smetanin; M. V. Ulyanov; A. S. Pestova. Entropy approach to the construction of a measure of word symbolic diverseness and its application to clustering of plant genomes. Matematičeskaâ biologiâ i bioinformatika, Tome 11 (2016) no. 1, pp. 114-126. http://geodesic.mathdoc.fr/item/MBB_2016_11_1_a7/

[1] Lothaire M., Algebraic Combinatorics of Words, Cambridge Univ. Press, Cambridge (UK), 2002, 455 pp. | MR

[2] Lind D., Marcus B., An introduction to symbolic dynamics and coding, Cambridge Univ. Press, Cambridge (UK), 1995, 495 pp. | MR | Zbl

[3] Shannon C. E., “A mathematical theory of communication”, Bell Syst. Techn. Journ., XXVII:3 (1948), 379–423 | DOI | MR | Zbl

[4] Shannon C. E., “A mathematical theory of communication”, Bell Syst. Techn. Journ., XXVII:4 (1948), 623–656 | DOI | MR | Zbl

[5] Kolmogorov A. N., “Obschaya teoriya dinamicheskikh sistem i klassicheskaya mekhanika”, Mezhdunarodnyi matematicheskii kongress v Amsterdame 1954 g: obzornye doklady, ed. Fomin S. V., Izd-vo AN SSSR, M., 1961, 187–208 | Zbl

[6] Khinchin A. Ya., “Ponyatie entropii v teorii veroyatnostei”, Uspekhi matematicheskikh nauk, 8:3(55) (1953), 3–20

[7] Martin N., Inglend Dzh., Matematicheskaya teoriya entropii, Mir, M., 1988, 350 pp. | MR

[8] Smetanin Y. G., Ulyanov M. V., “Reconstruction of a Word from a Finite Set of its Subwords under the unit Shift Hypothesis. I. Reconstruction without for Bidden Words”, Cybernetics and Systems Analysis, 50:1 (2014), 148–156 | DOI | MR | Zbl

[9] Wootton J. C., Federhen S., “Analysis of compositionally biased regions in sequence databases”, Methods Enzymol., 266 (1996), 554–571 | DOI

[10] Gusev V. D., Kulichkov V. A., Chupakhina O. M. Y., “Complexity analysis of genomes. I. Complexity and classification methods of detected structural regularities”, Mol. Biol. (Mosk)., 25:3 (1991), 825–834

[11] Gusev V. D., Kulichkov V. A., Chupakhina O. M., “The Lempel–Ziv complexity and local structure analysis of genomes”, Biosystems, 30:1–3 (1993), 183–200 | DOI

[12] Kislyuk O. S., Borovina T. A., Nazipova N. N., “Estimation of Redundancy of Genetic Texts by the High Frequency Component of the $l$-Gram Graph”, Biophysics, 44:4 (1999), 621–630

[13] Troyanskaya O. G., Arbell O., Koren Y., Landau G. M., Bolshoy A., “Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity”, Bioinformatics, 18:5 (2002), 679–688 | DOI

[14] Orlov Yu. L., Analiz regulyatornykh genomnykh posledovatelnostei s pomoschyu kompyuternykh metodov otsenok slozhnosti geneticheskikh tekstov, Diss. na soiskanie uch. st. kand. biol. nauk, Novosibirsk, 2004, 148 pp.

[15] Rudakov K. V., Torshin I. Yu., “Ob otbore informativnykh znachenii priznakov na baze kriteriev razreshimosti v zadache raspoznavaniya vtorichnoi struktury belka”, DAN, 441:1 (2011), 24–28 | Zbl

[16] Smetanin Yu. G., Ulyanov M. V., “Podkhod k opredeleniyu kharakteristik kolmogorovskoi slozhnosti vremennykh ryadov na osnove simvolnykh opisanii”, Biznes-informatika, 2013, no. 2(24), 49–54

[17] Smetanin Yu. G., Ulyanov M. V., “Mera simvolnogo raznoobraziya: podkhod kombinatoriki slov k opredeleniyu obobschennykh kharakteristik vremennykh ryadov”, Biznes-informatika, 2014, no. 3(29), 40–48

[18] Kormen T., Leizerson Ch., Rivest R., Shtain K., Algoritmy: postroenie i analiz, Izdatelskii dom «Vilyams», M., 2005, 1296 pp.

[19] GenBank, (data obrascheniya: 20.03.2016) http://www.ncbi.nlm.nih.gov/genbank/

[20] European Nucleotide Archive, (data obrascheniya: 20.03.2016) http://www.ebi.ac.uk/ena

[21] DNA Data Bank of Japan, (data obrascheniya: 20.03.2016) http://www.ddbj.nig.ac.jp/