Text classification by genre based on rhythm features
Modelirovanie i analiz informacionnyh sistem, Tome 28 (2021) no. 3, pp. 280-291.

Voir la notice de l'article provenant de la source Math-Net.Ru

The article is devoted to the analysis of the rhythm of texts of different genres: fiction novels, advertisements, scientific articles, reviews, tweets, and political articles. The authors identified lexico-grammatical figures in the texts: anaphora, epiphora, diacope, aposiopesis, etc., that are markers of the text rhythm. On their basis, statistical features were calculated that describe quantitatively and structurally these rhythm features. The resulting text model was visualized for statistical analysis using boxplots and heat maps that showed differences in the rhythm of texts of different genres. The boxplots showed that almost all genres differ from each other in terms of the overall density of rhythm features. Heatmaps showed different rhythm patterns across genres. Further, the rhythm features were successfully used to classify texts into six genres. The classification was carried out in two ways: a binary classification for each genre in order to separate a particular genre from the rest genres, and a multi-class classification of the text corpus into six genres at once. Two text corpora in English and Russian were used for the experiments. Each corpus contains 100 fiction novels, scientific articles, advertisements and tweets, 50 reviews and political articles, i.e. a total of 500 texts. The high quality of the classification with neural networks showed that rhythm features are a good marker for most genres, especially fiction. The experiments were carried out using the ProseRhythmDetector software tool for Russian and English languages. Text corpora contains 300 texts for each language.
Keywords: stylometry, natural language processing, rhythm features
Mots-clés : genres, text classification.
@article{MAIS_2021_28_3_a5,
     author = {K. V. Lagutina and N. S. Lagutina and E. I. Boychuk},
     title = {Text classification by genre based on rhythm features},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {280--291},
     publisher = {mathdoc},
     volume = {28},
     number = {3},
     year = {2021},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a5/}
}
TY  - JOUR
AU  - K. V. Lagutina
AU  - N. S. Lagutina
AU  - E. I. Boychuk
TI  - Text classification by genre based on rhythm features
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2021
SP  - 280
EP  - 291
VL  - 28
IS  - 3
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a5/
LA  - ru
ID  - MAIS_2021_28_3_a5
ER  - 
%0 Journal Article
%A K. V. Lagutina
%A N. S. Lagutina
%A E. I. Boychuk
%T Text classification by genre based on rhythm features
%J Modelirovanie i analiz informacionnyh sistem
%D 2021
%P 280-291
%V 28
%N 3
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a5/
%G ru
%F MAIS_2021_28_3_a5
K. V. Lagutina; N. S. Lagutina; E. I. Boychuk. Text classification by genre based on rhythm features. Modelirovanie i analiz informacionnyh sistem, Tome 28 (2021) no. 3, pp. 280-291. http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a5/

[1] J. Worsham, J. Kalita, “Genre identification and the compositional effect of genre in literature”, Proceedings of the 27th international conference on computational linguistics, 2018, 1963–1973

[2] M. N. Melissourgou, K. T. Frantzi, “Genre identification based on SFL principles: the representation of text types and genres in English language teaching material”, Corpus Pragmatics, 1:4 (2017), 373–392, Springer | DOI

[3] L. A. Kochetova, V. V. Popov, “Research of axiological dominants in press release genre based on automatic extraction of key words from corpus”, Nauchnyi dialog, 2019, no. 6 (In Russian) | DOI

[4] S. E. Murphy, “Shakespeare and his contemporaries: designing a genre classification scheme for early English books online 1560-1640”, ICAME Journal, 2019, 59–82 | DOI

[5] R. Malhotra, A. Sharma, “Quantitative evaluation of web metrics for automatic genre classification of web pages”, International Journal of System Assurance Engineering and Management, 8:2 (2017), 1567–1579, Springer

[6] V. Thakur, A. C. Patel, “An improved dictionary based genre classification based on title and abstract of e-book using machine learning algorithms”, Proceedings of second international conference on computing, communications, and cyber-security, Springer, 2021, 323–337 | DOI

[7] D. Dejica, “Understanding technical and scientific translation: a genre-based approach”, Scientific Bulletin of the Politehnica University of Timisoara. Transactions on Modern Languages/Buletinul Stiintific al Universitatii Politehnica din Timisoara. Seria Limbi Moderne, 19:1 (2020), 56–66

[8] A. Cimino, M. Wieling, F. Dell'Orletta, S. Montemagni, G. Venturi, “Identifying predictive features for textual genre classification: the key role of syntax”, Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017, 2017, 107–112 | DOI

[9] K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, I. Paramonov, “Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries”, Proceedings of the 26th Conference of Open Innovations Association FRUCT, IEEE, 2020, 247–255

[10] K. Lagutina, N. Lagutina, E. Boychuk, V. Larionov, I. Paramonov, “Authorship verification of literary texts with rhythm features”, Proceedings of the 28th Conference of Open Innovations Association FRUCT, IEEE, 2021, 240–251

[11] A. Onan, “An ensemble scheme based on language function analysis and feature engineering for text genre classification”, Journal of Information Science, 44:1 (2018), 28–47 | DOI

[12] A. M. El-Halees, “Arabic text genre classification”, Journal of Engineering Research and Technology, 4:3 (2017), 105–109

[13] I. A. Batraeva, A. D. Nartsev, A. S. Lezgyan, “Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning”, Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie vychislitelnaja tehnika i informatika, 2020, no. 50, 14–22 (In Russian) | DOI

[14] V. B. Barahnin, O. Y. Kozhemyakina, E. V. Rychkova, I. S. Pastushkov, Y. S. Borzilova, “Izvlechenie leksicheskih i metroritmicheskih priznakov, harakternyh dlya zhanra i stilya i ih kombinacij v processe avtomatizirovannoj obrabotki tekstov na russkom yazyke”, Sovremennye informacionnye tekhnologii i IT-obrazovanie, 14:4 (2018), 888–895 (In Russian)

[15] O. A. Mitrofanova, A. D. Moskvina, “On the role of prepositional statistics for genre identification of Russian texts”, International Journal of Open Information Technologies, 8:11 (2020), 91–96 (In Russian)

[16] L. G. Gorbich, A. A. Zhivoderov, “Using statistical indexes to distinguish between scientific and popular science texts on the example of the works of A. E. Fersman”, Software Systems, 33:4 (2020), 720–725 (In Russian) | DOI

[17] A. R. Dubovik, “Automatic text style identification in terms of statistical parameters”, Komp'yuternaya lingvistika i vychislitel'nye ontologii, 2017, no. 1, 29–45 (In Russian)

[18] A. Y. Antonova, E. S. Klyshinskij, E. V. Yagunova, “Opredelenie stilevyh i zhanrovyh harakteristik kollekcij tekstov na osnove chasterechnoj sochetaemosti”, Otkrytye sistemy, 3 (2011), 80–85 (In Russian)

[19] M. Sokolova, G. Lapalme, “A systematic analysis of performance measures for classification tasks”, Information processing management, 45:4 (2009), 427–437 | DOI

[20] L. Kozlova, Sravnitel'naya tipologiya anglijskogo i russkogo yazykov, AltGPU, Barnaul, 2019, 180 pp. (In Russian)

[21] A. Wierzbicka, The semantics of grammar, Studies in Language Companion Series, 18, John Benjamins Publishing, 1988, 617 pp. | DOI