Classification of articles from mass media by categories and relevance of the subject area
Modelirovanie i analiz informacionnyh sistem, Tome 29 (2022) no. 3, pp. 266-279.

Voir la notice de l'article provenant de la source Math-Net.Ru

The research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”. The proposed approaches are based on using the BERT neural network and methods of machine learning: SVM, Logistic Regression, K-Neighbors, Random Forest, in combination of different embedding types: Word2Vec, FastText, TF-IDF, GPT-3. Also approaches of text preprocessing are considered to achieve higher quality of the classification. The experiments showed that the SVM classifier with TF-IDF embedding and trained on full article texts with titles achieved the best result. Its micro-F-measure and macro-F-measure are 0.8214 and 0.8308 respectively. The BERT neural network trained on fragments of paragraphs with YarSU mentions, from which the first 128 words and the last 384 words were taken, showed comparable results. The resulting micro-F-measure and macro-F-measure are 0.8304 and 0.8181 respectively. Thus, using paragraphs with the target organisation mentions is enough to classify text by categories efficiently.
Keywords: classification by categories, automatic text processing, subject area, Russian language, news articles.
@article{MAIS_2022_29_3_a7,
     author = {V. D. Larionov and I. V. Paramonov},
     title = {Classification of articles from mass media by categories and relevance of the subject area},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {266--279},
     publisher = {mathdoc},
     volume = {29},
     number = {3},
     year = {2022},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2022_29_3_a7/}
}
TY  - JOUR
AU  - V. D. Larionov
AU  - I. V. Paramonov
TI  - Classification of articles from mass media by categories and relevance of the subject area
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2022
SP  - 266
EP  - 279
VL  - 29
IS  - 3
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2022_29_3_a7/
LA  - ru
ID  - MAIS_2022_29_3_a7
ER  - 
%0 Journal Article
%A V. D. Larionov
%A I. V. Paramonov
%T Classification of articles from mass media by categories and relevance of the subject area
%J Modelirovanie i analiz informacionnyh sistem
%D 2022
%P 266-279
%V 29
%N 3
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2022_29_3_a7/
%G ru
%F MAIS_2022_29_3_a7
V. D. Larionov; I. V. Paramonov. Classification of articles from mass media by categories and relevance of the subject area. Modelirovanie i analiz informacionnyh sistem, Tome 29 (2022) no. 3, pp. 266-279. http://geodesic.mathdoc.fr/item/MAIS_2022_29_3_a7/

[1] A. Hussain, G. Ali, F. Akhtar, Z. H. Khand, and A. Ali, “Design and analysis of news category predictor”, Engineering, Technology Applied Science Research, 10:5 (2020), 6380–6385 | DOI

[2] G. Kaur and K. Bajaj, “News classification using neural networks”, Communications on applied electronics, 5:1 (2016), 42–45 | DOI

[3] P. Semberecki and H. Maciejewski, “Deep learning methods for subject text classification of articles”, 2017 Federated Conference on Computer Science and Information Systems, FedCSIS, IEEE, 2017, 357–360 | DOI

[4] X. Luo, “Efficient English text classification using selected machine learning techniques”, Alexandria Engineering Journal, 60:3 (2021), 3401–3409 | DOI | MR

[5] S. Vychegzhanin, E. Kotelnikov, and V. Milov, “Comparative analysis of machine learning methods for news categorization in Russian”, CEUR Workshop Proceedings, 2922, 2021, 100–108

[6] N. A. Gordienko, “Klassifikaciya novostej s primeneniem metodov mashinnogo obucheniya i obrabotki estestvennogo yazyka”, Innovacionnye resheniya social'nyh, ekonomicheskih i tekhnologicheskih problem sovremennogo obshchestva, 2021, 63–65 (in Russian)

[7] E. N. Karuna and P. V. Sokolov, “Comparison of methods for automatic classification of Russian-language texts”, Journal of Physics: Conference Series, 1864 (2021), 012117 | DOI

[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv: 1810.04805 [cs.CL]

[9] F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in Python”, Journal of machine Learning research, 12 (2011), 2825–2830 | MR | Zbl

[10] T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, 2013, arXiv: 1301.3781v3 [cs.CL]

[11] R. Rehurek and P. Sojka, “Software framework for topic modelling with large corpora”, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, 45–50

[12] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jegou, and T. Mikolov, Fasttext.zip: Compressing text classification models, 2016, arXiv: 1612.03651 [cs.CL]

[13] K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval”, Journal of documentation, 28:1 (1972), 11–22 | DOI

[14] T. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners”, Advances in neural information processing systems, 33 (2020), 1877–1901

[15] T. Wolf, L. Debut, V. Sanh, et al., “Transformers: State-of-the-art natural language processing”, Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, 38–45 | DOI

[16] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks”, Information Processing Management, 45 (2009), 427–437 | DOI