Comparison of style features for the authorship verification of literary texts

K. V. Lagutina

K. V. Lagutina

Modelirovanie i analiz informacionnyh sistem, Tome 28 (2021) no. 3, pp. 250-259

Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

Résumé

The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered. The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.

Keywords: stylometry, natural language processing, style features, rhythm features, authorship verification.

@article{MAIS_2021_28_3_a3,
     author = {K. V. Lagutina},
     title = {Comparison of style features for the authorship verification of literary texts},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {250--259},
     year = {2021},
     volume = {28},
     number = {3},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a3/}
}

TY  - JOUR
AU  - K. V. Lagutina
TI  - Comparison of style features for the authorship verification of literary texts
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2021
SP  - 250
EP  - 259
VL  - 28
IS  - 3
UR  - http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a3/
LA  - en
ID  - MAIS_2021_28_3_a3
ER  -

%0 Journal Article
%A K. V. Lagutina
%T Comparison of style features for the authorship verification of literary texts
%J Modelirovanie i analiz informacionnyh sistem
%D 2021
%P 250-259
%V 28
%N 3
%U http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a3/
%G en
%F MAIS_2021_28_3_a3

K. V. Lagutina. Comparison of style features for the authorship verification of literary texts. Modelirovanie i analiz informacionnyh sistem, Tome 28 (2021) no. 3, pp. 250-259. http://geodesic.mathdoc.fr/item/MAIS_2021_28_3_a3/

Bibliographie
Cité par

[1] E. Stamatatos, “A survey of modern authorship attribution methods”, Journal of the American Society for information Science and Technology, 60:3 (2009), 538–556 | DOI

[2] K. Lagutina, N. Lagutina, E. Boychuk, I. Vorontsova, E. Shliakhtina, O. Belyaeva, I. Paramonov, “A survey on stylometric text features”, Proceedings of the 25th conference of open innovations association (FRUCT), IEEE, 2019, 184–195 | DOI

[3] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, D. Woodard, “Surveying stylometry techniques and applications”, ACM Computing Surveys (CSUR), 50:6 (2018), 1–36 | DOI

[4] C. Lim, Y. Jeong, H. Choi, “Survey of temporal information extraction”, Journal of Information Processing Systems, 15:4 (2019), 931–956

[5] E. Boychuk, I. Paramonov, N. Kozhemyakin, N. Kasatkina, “Automated approach for rhythm analysis of French literary texts”, Proceedings of 15th conference of open innovations association FRUCT, IEEE, 2014, 15–23 | DOI

[6] K. Lagutina, N. Lagutina, E. Boychuk, V. Larionov, I. Paramonov \title Authorship verification of literary texts with rhythm features, Proceedings of the 28th conference of open innovations association FRUCT, 2021, 240-251 | DOI

[7] N. Potha, E. Stamatatos, “Intrinsic author verification using topic modeling”, Proceedings of the 10th hellenic conference on artificial intelligence, ACM, 2018, 1–7

[8] O. Halvani, L. Graner, “Rethinking the evaluation methodology of authorship verification methods”, International conference of the cross-language evaluation forum for European languages, Springer, 2018, 40–51

[9] O. Halvani, L. Graner, R. Regev, “TAVeer: an interpretable topic-agnostic authorship verification method”, Proceedings of the 15th international conference on availability, reliability and security, 2020, 1–10

[10] B. Boenninghoff, R. M. Nickel, S. Zeiler, D. Kolossa, “Similarity learning for authorship verification in social media”, ICASPP 2019-2019 IEEE international conference on acoustics, speech and signal processing, ICASPP, IEEE, 2019, 2457–2461

[11] S. Adamovic, V. Miskovic, M. Milosavljevic, M. Sarac, M. Veinovic, “Automated language-independent authorship verification (for Indo-European languages)”, Journal of the Association for Information Science and Technology, 70:8 (2019), 858–871 | DOI

[12] M. A. Al-Khatib, J. K. Al-qaoud, “Authorship verification of opinion articles in online newspapers using the idiolect of author: a comparative study”, Information, Communication Society, Taylor Francis, 2020, 1–19

[13] T. Stanisz, J. Kwapień, S. Drożdż, “Linguistic data mining with complex networks: a stylometric-oriented approach”, Information Sciences, 482 (2019), 301–320 | DOI

[14] K. Lagutina, A. Poletaev, N. Lagutina, E. Boychuk, I. Paramonov, “Automatic extraction of rhythm figures and analysis of their dynamics in prose of 19th-21st centuries”, Proceedings of the 26th conference of open innovations association FRUCT, IEEE, 2020, 247–255

[15] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, “Text classification algorithms: a survey”, Information, 10:4 (2019), 150, 68 pp. | DOI

[16] M. Sokolova, G. Lapalme, “A systematic analysis of performance measures for classification tasks”, Information processing management, 45:4 (2009), 427–437 | DOI

Parcourir par

Geodesic

Parcourir par