Automated search and analysis of the stylometric features that describe the style of the prose 19th-21st centuries
Modelirovanie i analiz informacionnyh sistem, Tome 27 (2020) no. 3, pp. 330-343.

Voir la notice de l'article provenant de la source Math-Net.Ru

The article is devoted to comparison of stylometric features of several levels, which are markers of the style of the prose text and analysis of the stylistic changes in Russian and British prose of the 19th–21st centuries. Stylometric features include the low-level features based on the words and symbols and high-level based on rhythmic. These features model the style of a text and are the indicators of the time when the text was created. Calculations of all the features are performed completely automatically, so it allows to conduct the large-scale experiments with artworks of a large volume and speeds up the work of a linguist. To calculate the stylometric features including ones based on the search results for rhythmic figures the ProseRhythmDetector program is used. As a result of its work, each text is presented as a set of the same features of three levels: characters, words, rhythm. Texts are combined by decades, for each decade there are found average values of stylometric features. The obtained models of decades are compared using standard similarity metrics, results of comparison are visualized in the form of the heat maps and dendrograms. Experiments with two corpora of Russian and British texts show that during the 19th–21st centuries there are general trends in style change for both corpora, for example, a decrease in the number of rhythmic figures per sentence, and also particular trends for each language, for example, dynamics of change of the word and sentence lengths. Stylometric features of all levels reveal the similarity in the style of texts published in one century. Also, features of three levels in the complex better demonstrate the uniqueness of each decade than features of a particular level. This study shows the importance of stylometric features as style markers of the different eras and allows us to identify trends in style during several centuries.
Keywords: text rhythm, rhythm analysis, natural language processing, stylometry, rhythm figures
Mots-clés : automation.
@article{MAIS_2020_27_3_a3,
     author = {K. V. Lagutina and A. M. Manakhova},
     title = {Automated search and analysis of the stylometric features that describe the style of the prose 19th-21st centuries},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {330--343},
     publisher = {mathdoc},
     volume = {27},
     number = {3},
     year = {2020},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2020_27_3_a3/}
}
TY  - JOUR
AU  - K. V. Lagutina
AU  - A. M. Manakhova
TI  - Automated search and analysis of the stylometric features that describe the style of the prose 19th-21st centuries
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2020
SP  - 330
EP  - 343
VL  - 27
IS  - 3
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2020_27_3_a3/
LA  - ru
ID  - MAIS_2020_27_3_a3
ER  - 
%0 Journal Article
%A K. V. Lagutina
%A A. M. Manakhova
%T Automated search and analysis of the stylometric features that describe the style of the prose 19th-21st centuries
%J Modelirovanie i analiz informacionnyh sistem
%D 2020
%P 330-343
%V 27
%N 3
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2020_27_3_a3/
%G ru
%F MAIS_2020_27_3_a3
K. V. Lagutina; A. M. Manakhova. Automated search and analysis of the stylometric features that describe the style of the prose 19th-21st centuries. Modelirovanie i analiz informacionnyh sistem, Tome 27 (2020) no. 3, pp. 330-343. http://geodesic.mathdoc.fr/item/MAIS_2020_27_3_a3/

[1] E. Boychuk, I. Paramonov, N. Kozhemyakin, N. Kasatkina, “Automated approach for rhythm analysis of French literary texts”, Proceedings of 15th Conference of Open Innovations Association FRUCT, IEEE, 2014, 15–23 | DOI

[2] N. Golubeva-Monatkina, “On the Problem of Prose Rhythm”, The Bulletin of the Russian Academy of Sciences: Studies in Literature and Language, 76:2 (2017), 16–27 (In Russian)

[3] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, D. Woodard, “Surveying stylometry techniques and applications”, ACM Computing Surveys (CSUR), 50:6 (2018), 86 | DOI | MR

[4] K. Lagutina, N. Lagutina, E. Boychuk, I. Vorontsova, E. Shliakhtina, O. Belyaeva, I. Paramonov, “A Survey on Stylometric Text Features”, Proceedings of the 25th Conference of Open Innovations Association FRUCT, IEEE, 2019, 184–195

[5] G. Ya. Martynenko, Metody matematicheskoj lingvistiki v stilisticheskih issledovaniyah, Nestor-Istoriya, 2019 (In Russian)

[6] A. Kumar, M. Lease, J. Baldridge, “Supervised language modeling for temporal resolution of texts”, Proceedings of the 20th ACM international conference on Information and knowledge management, 2011, 2069–2072

[7] A. Jatowt, R. Campos, “Interactive system for reasoning about document age”, Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, 2471–2474 | DOI | MR

[8] O. Popescu, C. Strapparava, “Semeval 2015, task 7: Diachronic text evaluation”, Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval 2015, 2015, 870–878

[9] A. Gopidi, A. Alam, “Computational Analysis of the Historical Changes in Poetry and Prose”, Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, 2019, 14–22 | DOI

[10] H. Lan, J. Huang, “Chinese-English Cross-Lingual Text Clustering Algorithm based on Latent Semantic Analysis”, Proceedings of Science, 2017, 1–7

[11] A. Esuli, A. Moreo, F. Sebastiani, “Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and Its Application to Cross-Lingual Text Classification”, ACM Transactions on Information Systems (TOIS), 37:3 (2019), 1–30 | DOI

[12] K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, I. Paramonov, “Automatic Extraction of Rhythm Figures and Analysis of Their Dynamics in Prose of 19th-21st Centuries”, 26th Conference of Open Innovations Association (FRUCT), IEEE, 2020, 247–255 | DOI