Algorithm of the correction of bigram method for the problem of the text author identification

M. Yu. Voronina; A. A. Kislitsyn; Yu. N. Orlov

M. Yu. Voronina ; A. A. Kislitsyn ; Yu. N. Orlov

Matematičeskoe modelirovanie, Tome 34 (2022) no. 9, pp. 3-20

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

The paper proposes a model for recognizing authors of literary texts based on the proximity of an individual text to the author's standard. The standard is the empirical frequency distribution of letter combinations, constructed according to all reliably known works of the author. Proximity is understood in the sense of the norm in L1. The author of an unknown text is assigned the one to whose standard the text under test is closest. For identification, a library of authors is used, each of which has a sufficiently large number of works defining the corresponding standards of two letter combinations. Testing of this identification method on the authors of the library has shown that it is very accurate. In the analyzed corpus of texts, 1783 texts of 100 authors were collected, the recognition error by the best method turned out to be 0.12. It is important that after the exclusion of erroneously recognized texts, a library of 88 authors and 1450 texts remained, each of which was identified correctly. The problem under study is the assessment of the probability that there is no standard of the author of the tested text among the library standards. To solve it, the paper analyzes the dependence of the probability of erroneous identification on the length of the text. Using the example of an unmistakably determined subgroup of texts, it turned out that the empirical probability of correct recognition of a text fragment, although it decreases with a decrease in the length of the fragment, still exceeds 0.5 up to the fragmentation of the text into 10 parts. If we take smaller fragments, some of them are identified incorrectly. If the correct standard is excluded from consideration, the second closest standard is assigned as such, but it turns out to be unstable: the ambiguity of such identification of the author of fragments occurs already when the text is cut into 4 fragments. Thus, the stability of the identification of the author of text fragments can be proposed as a criterion for the correctness of the method.

Keywords: text, author, correction of error probability.
Mots-clés : bigram distribution, fragment identification

@article{MM_2022_34_9_a0,
     author = {M. Yu. Voronina and A. A. Kislitsyn and Yu. N. Orlov},
     title = {Algorithm of the correction of bigram method for the problem of the text author identification},
     journal = {Matemati\v{c}eskoe modelirovanie},
     pages = {3--20},
     year = {2022},
     volume = {34},
     number = {9},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MM_2022_34_9_a0/}
}

TY  - JOUR
AU  - M. Yu. Voronina
AU  - A. A. Kislitsyn
AU  - Yu. N. Orlov
TI  - Algorithm of the correction of bigram method for the problem of the text author identification
JO  - Matematičeskoe modelirovanie
PY  - 2022
SP  - 3
EP  - 20
VL  - 34
IS  - 9
UR  - http://geodesic.mathdoc.fr/item/MM_2022_34_9_a0/
LA  - ru
ID  - MM_2022_34_9_a0
ER  -

%0 Journal Article
%A M. Yu. Voronina
%A A. A. Kislitsyn
%A Yu. N. Orlov
%T Algorithm of the correction of bigram method for the problem of the text author identification
%J Matematičeskoe modelirovanie
%D 2022
%P 3-20
%V 34
%N 9
%U http://geodesic.mathdoc.fr/item/MM_2022_34_9_a0/
%G ru
%F MM_2022_34_9_a0

M. Yu. Voronina; A. A. Kislitsyn; Yu. N. Orlov. Algorithm of the correction of bigram method for the problem of the text author identification. Matematičeskoe modelirovanie, Tome 34 (2022) no. 9, pp. 3-20. http://geodesic.mathdoc.fr/item/MM_2022_34_9_a0/

Bibliographie
Cité par

[1] R. Roscher, B. Bohn, M. F. Duarte, J. Garcke, “Explainable Mashine Learning for Scientific Insights and Discoveries”, IEEE Access, 8 (2020) | DOI

[2] J. Durbin, Distribution Theory for Tests Based on The Sample Distribution Function, Society for Industrial Applied Mathematics, Philadelphia, 1972 | MR

[3] T. V. Batura, “Metody avtomaticheskoi klassifikatsii tekstov”, Programmnye produkty i sistemy, 30:1 (2017), 85–99 | MR

[4] O. Bakhteev, A. Khazov, “Author Masking using Sequence-to-Sequence Models Notebook for PAN at CLEF 2017”, Evaluation Labs and Workshop Working Notes Papers, CLEF 2017 (11–14 September, Dublin, Ireland), eds. Cappellato L., Ferro N., Goeuriot L., Mandl T. CEURWS.org

[5] E. Stamatatos, N. Fakotakis, G. Kokkinakis, “Automatic Text Categorization in Terms of Genre and Author”, Computational Linguistics, 26:4 (2000), 471–495 | DOI

[6] S. Argamon, P. Juola, “Overview of the International Authorship Identification Competition at PAN-2011”, CLEF Notebook Papers/Labs/Workshop, eds. V. Petras, P. Forner, P.D. Clough, 2011

[7] Y. Sari, A. Vlachos, M. Stevenson, “Continuous n-gram representations for authorship attribution”, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Short Papers (Valencia, Spain), v. 2, 2017, 267–273

[8] M. Kuta, B. Puto, J. Kitowski, “Authorship Attribution of Polish Newspaper Articles”, ICAISC 2016, 474–483 | DOI

[9] Z. I. Rezanova, A. S. Romanov, R. V. Meshcheriakov, “O vybore priznakov teksta, relevantnykh v avtorovedcheskoi ekspertnoi deiatelnosti”, Vestnik Tomskogo gosudarstvennogo universiteta. Filologiia, 26:6 (2013), 38–52

[10] A. A. Markov, “Primer statisticheskogo issledovaniia nad tekstom «Evgeniia Onegina», illiustriruiushchii sviaz ispytanii v tsep”, Izvestiia Imp. Akad. nauk, seriia VI, X:3 (1913), 153

[11] D. V. Khmelev, “Raspoznavanie avtora teksta s ispolzovaniem tsepei A.A. Markova”, Vestnik MGU, ser. 9: filologiia, 2000, no. 2, 115–126

[12] A. S. Surkova, “Identifikatsiia avtorstva tekstov na osnove informatsionnykh portretov”, Informatsionnye tekhnologii. Vestnik Nizhegorodskogo universiteta im. N.I. Lobachevskogo, 2014, no. 3 (1)

[13] Iu. N. Orlov, K. P. Osminin, Metody statisticheskogo analiza literaturnykh tekstov, Editorial URSS/Knizhnyi dom «LIBROKOM», M., 2012, 312 pp.

[14] Iu. N. Orlov, K. P. Osminin, “Opredelenie zhanra i avtora literaturnogo proizvedeniia statisticheskimi metodami”, Prikladnaia informatika, 26:2 (2010), 95–108

[15] N. A. Mitin, Iu. N. Orlov, “Statisticheskii analiz bigramm spetsializirovannykh tekstov”, Kompiuternye issledovaniia i modelirovanie, 12:1 (2020), 243–254

[16] A. A. Vlasiuk, Iu. N. Orlov, “Tochnost identifikatsii vyborochnykh raspredelenii vremennykh riadov v zavisimosti ot tipa raspredeleniia, normy i dliny vyborki”, Keldysh Institute preprints, 2015, 017, 25 pp.

Parcourir par

Geodesic

Parcourir par