Classification of russian texts by genres based on modern embeddings and rhythm
Modelirovanie i analiz informacionnyh sistem, Tome 29 (2022) no. 4, pp. 334-347.

Voir la notice de l'article provenant de la source Math-Net.Ru

The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.
Keywords: stylometry, natural language processing, rhythm features, ELMo.
Mots-clés : genres, text classification, BERT
@article{MAIS_2022_29_4_a2,
     author = {K. V. Lagutina},
     title = {Classification of russian texts by genres based on modern embeddings and rhythm},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {334--347},
     publisher = {mathdoc},
     volume = {29},
     number = {4},
     year = {2022},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2022_29_4_a2/}
}
TY  - JOUR
AU  - K. V. Lagutina
TI  - Classification of russian texts by genres based on modern embeddings and rhythm
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2022
SP  - 334
EP  - 347
VL  - 29
IS  - 4
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2022_29_4_a2/
LA  - ru
ID  - MAIS_2022_29_4_a2
ER  - 
%0 Journal Article
%A K. V. Lagutina
%T Classification of russian texts by genres based on modern embeddings and rhythm
%J Modelirovanie i analiz informacionnyh sistem
%D 2022
%P 334-347
%V 29
%N 4
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2022_29_4_a2/
%G ru
%F MAIS_2022_29_4_a2
K. V. Lagutina. Classification of russian texts by genres based on modern embeddings and rhythm. Modelirovanie i analiz informacionnyh sistem, Tome 29 (2022) no. 4, pp. 334-347. http://geodesic.mathdoc.fr/item/MAIS_2022_29_4_a2/

[1] L. A. Kochetova, V. V. Popov, “Research of axiological dominants in press release genre based on automatic extraction of key words from corpus”, Nauchnyi dialog, 2019, no. 6

[2] B. Kessler, G. Numberg, H. Schütze, “Automatic detection of text genre”, Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the european chapter of the association for computational linguistics, 1997, 32–38

[3] ğ Onan, “An ensemble scheme based on language function analysis and feature engineering for text genre classification”, Journal of Information Science, 44:1 (2018), 28–47

[4] Z. Dai, R. Huang, “A joint model for structure-based news genre classification with application to text summarization”, Findings of the association for computational linguistics, ACL-IJCNLP 2021, 2021, 3332–3342

[5] K. V. Lagutina, N. S. Lagutina, E. I. Boychuk, “Text classification by genre based on rhythm features”, Modeling and Analysis of Information Systems, 28:3 (2021), 280–291

[6] K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, I. Paramonov, “Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries”, Proceedings of the 26th Conference of Open Innovations Association FRUCT, IEEE, 2020, 247–255

[7] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, “Deep contextualized word representations”, Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, v. 1, long papers, 2018, 2227–2237

[8] J. Devlin, M. Chang, K. Lee, K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding”, Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, v. 1, long and short papers, 2019, 4171–4186

[9] C. Wang, P. Nulty, D. Lillis, “A comparative study on word embeddings in deep learning for text classification”, Proceedings of the 4th international conference on natural language processing and information retrieval, 2020, 37–46

[10] Y. Kuratov, M. Arkhipov, “Adaptation of deep bidirectional multilingual transformers for Russian language”, Komp'juternaja lingvistika i intellektual'nye tehnologii, 2019, 333–339

[11] A. Kutuzov, L. Pivovarova et al., “RusSiftEval: a shared task on semantic shift detection for Russian”, papers from the annual international conference “Dialogue”, Computational linguistics and intellectual technologies, 20, 2021, 533–545

[12] J. Rodina, Y. Trofimova, A. Kutuzov, E. Artemova, “ELMo and BERT in semantic change detection for Russian”, International conference on analysis of images, social networks and texts, Springer, 2020, 175–186

[13] A. V. Glazkova, “Topical classification of text fragments accounting for their nearest context”, Automation and Remote Control, 81:12 (2020), 2262–2276, Springer

[14] I. A. Batraeva, A. D. Nartsev, A. S. Lezgyan, “Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning”, Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie vychislitelnaja tehnika i informatika, 2020, no. 50, 14–22

[15] V. Bocharov, S. Alexeeva, D. Granovsky, E. Protopopova, M. Stepanova, A. Surikov, “Crowdsourcing morphological annotation”, Computational linguistics and intellectual technologies: papers from the annual international conference “Dialogue”, v. 1, 2013, 109–114

[16] K. Lagutina, N. Lagutina, E. Boychuk, V. Larionov, I. Paramonov, “Authorship verification of literary texts with rhythm features”, 28th conference of open innovations association FRUCT, IEEE, 2021, 240–251