Comparative analysis of four methods for identifying letters of texts

Yu. A. Kotov

Yu. A. Kotov

Informacionnye tehnologii i vyčislitelnye sistemy, no. 3 (2019), pp. 41-56

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

The article presents the results of a comparison of four known frequency methods for identifying letters of texts that are necessary for an applied solution of cryptoanalysis, steganography, and general text analysis problems known in computer science as text mining. To compare and obtain a complete and unified characterization of the methods, an evaluation method is proposed, which includes the measurement of three identification errors and the formation of an integral characteristic based on them, called the goodness of the method. According to this method, an experimental comparison and qualitative analysis of one unigram and three bigram methods of identifying letters of texts was carried out. The comparison was made on representative samples of fragments of Russian texts. The qualitative and quantitative features of the methods, the boundaries of their effective use, the relationship with the type and volume of the text being processed are determined. It is also shown that an important boundary of text volume for frequency methods and Russianlanguage texts is a text of approximately 4,000 characters. Such a volume is quite sufficient for the frequency identification of alphabet characters in a Russian-language text with minimal error, and in some cases for obtaining an exact solution. It is shown that with this and a larger amount of text, frequency methods for alphabet characters identification and the proposed estimates of their inaccuracies can be used to quantify certain stylistic features of the text.

Keywords: text, alphabet character, cipher, text analysis.
Mots-clés : unigram, bigram, identification, one-to-one substitution

@article{ITVS_2019_3_a3,
     author = {Yu. A. Kotov},
     title = {Comparative analysis of four methods for identifying letters of texts},
     journal = {Informacionnye tehnologii i vy\v{c}islitelnye sistemy},
     pages = {41--56},
     year = {2019},
     number = {3},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/ITVS_2019_3_a3/}
}

TY  - JOUR
AU  - Yu. A. Kotov
TI  - Comparative analysis of four methods for identifying letters of texts
JO  - Informacionnye tehnologii i vyčislitelnye sistemy
PY  - 2019
SP  - 41
EP  - 56
IS  - 3
UR  - http://geodesic.mathdoc.fr/item/ITVS_2019_3_a3/
LA  - ru
ID  - ITVS_2019_3_a3
ER  -

%0 Journal Article
%A Yu. A. Kotov
%T Comparative analysis of four methods for identifying letters of texts
%J Informacionnye tehnologii i vyčislitelnye sistemy
%D 2019
%P 41-56
%N 3
%U http://geodesic.mathdoc.fr/item/ITVS_2019_3_a3/
%G ru
%F ITVS_2019_3_a3

Yu. A. Kotov. Comparative analysis of four methods for identifying letters of texts. Informacionnye tehnologii i vyčislitelnye sistemy, no. 3 (2019), pp. 41-56. http://geodesic.mathdoc.fr/item/ITVS_2019_3_a3/

Parcourir par

Geodesic

Parcourir par