Keywords, morpheme parsing and syntactic trees: features for text complexity assessment

D. A. Morozov; I. A. Smal; T. A. Garipov; A. V. Glazkova

D. A. Morozov ; I. A. Smal ; T. A. Garipov ; A. V. Glazkova

Modelirovanie i analiz informacionnyh sistem, Tome 31 (2024) no. 2, pp. 206-220

Voir la notice de l'article provenant de la source Math-Net.Ru

Résumé

The text complexity assessment is an applied problem of current interest with potential application in the drafting of legal documents, editing textbooks, and selecting books for extracurricular reading. The methods for generating a feature vector when automatically assessing the text complexity are quite diverse. Early approaches relied on easily calculable quantities, such as the average length of a sentence or the average number of syllables per word. With the development of natural language processing algorithms, the space of used features is expanding. In this work, we examined three groups of features: 1) automatically generated keywords, 2) information about the features of morphemic word parsing, and 3) information about the diversity, branching, and depth of syntactic trees. The RuTermExtract algorithm was utilized to generate keywords, a convolutional neural network model was used to generate morphemic parses, and the Stanza model, trained on the SynTagRus corpus, was used to generate syntax trees. We conducted a comparison using four different machine learning algorithms and four annotated Russian-language text corpora. The corpora used differ both in the domain and markup paradigm, due to which the results obtained more objectively reflect the real relationship between the characteristics and the text complexity. The use of keywords performed worse on average than the use of topic markers obtained using latent Dirichlet allocation. In most situations, morphemic characteristics turned out to be more effective than previously described methods for assessing the lexical complexity of a text: the frequency of words and the occurrence of word-formation patterns. The use of an extensive set of syntactic features allowed, in most cases, to improve the quality of work of neural network models in comparison with the previously described set.

Keywords: text complexity, keyword generation, morpheme parsing generation, syntax trees.

@article{MAIS_2024_31_2_a6,
     author = {D. A. Morozov and I. A. Smal and T. A. Garipov and A. V. Glazkova},
     title = {Keywords, morpheme parsing and syntactic trees: features for text complexity assessment},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {206--220},
     publisher = {mathdoc},
     volume = {31},
     number = {2},
     year = {2024},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2024_31_2_a6/}
}

TY  - JOUR
AU  - D. A. Morozov
AU  - I. A. Smal
AU  - T. A. Garipov
AU  - A. V. Glazkova
TI  - Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2024
SP  - 206
EP  - 220
VL  - 31
IS  - 2
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2024_31_2_a6/
LA  - ru
ID  - MAIS_2024_31_2_a6
ER  -

%0 Journal Article
%A D. A. Morozov
%A I. A. Smal
%A T. A. Garipov
%A A. V. Glazkova
%T Keywords, morpheme parsing and syntactic trees: features for text complexity assessment
%J Modelirovanie i analiz informacionnyh sistem
%D 2024
%P 206-220
%V 31
%N 2
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2024_31_2_a6/
%G ru
%F MAIS_2024_31_2_a6

D. A. Morozov; I. A. Smal; T. A. Garipov; A. V. Glazkova. Keywords, morpheme parsing and syntactic trees: features for text complexity assessment. Modelirovanie i analiz informacionnyh sistem, Tome 31 (2024) no. 2, pp. 206-220. http://geodesic.mathdoc.fr/item/MAIS_2024_31_2_a6/

Parcourir par

Geodesic

Parcourir par