Text classification by CEFR levels using machine learning methods and BERT language model
Modelirovanie i analiz informacionnyh sistem, Tome 30 (2023) no. 3, pp. 202-213

Voir la notice de l'article provenant de la source Math-Net.Ru

This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.
Keywords: natural language processing, CEFR
Mots-clés : text classification, BERT.
@article{MAIS_2023_30_3_a1,
     author = {N. S. Lagutina and K. V. Lagutina and A. M. Brederman and N. N. Kasatkina},
     title = {Text classification by {CEFR} levels using machine learning methods and {BERT} language model},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {202--213},
     publisher = {mathdoc},
     volume = {30},
     number = {3},
     year = {2023},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2023_30_3_a1/}
}
TY  - JOUR
AU  - N. S. Lagutina
AU  - K. V. Lagutina
AU  - A. M. Brederman
AU  - N. N. Kasatkina
TI  - Text classification by CEFR levels using machine learning methods and BERT language model
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2023
SP  - 202
EP  - 213
VL  - 30
IS  - 3
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2023_30_3_a1/
LA  - ru
ID  - MAIS_2023_30_3_a1
ER  - 
%0 Journal Article
%A N. S. Lagutina
%A K. V. Lagutina
%A A. M. Brederman
%A N. N. Kasatkina
%T Text classification by CEFR levels using machine learning methods and BERT language model
%J Modelirovanie i analiz informacionnyh sistem
%D 2023
%P 202-213
%V 30
%N 3
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2023_30_3_a1/
%G ru
%F MAIS_2023_30_3_a1
N. S. Lagutina; K. V. Lagutina; A. M. Brederman; N. N. Kasatkina. Text classification by CEFR levels using machine learning methods and BERT language model. Modelirovanie i analiz informacionnyh sistem, Tome 30 (2023) no. 3, pp. 202-213. http://geodesic.mathdoc.fr/item/MAIS_2023_30_3_a1/