Keyphrase generation for the Russian-language scientific texts using mT5

A. V. Glazkova; D. A. Morozov; M. S. Vorobeva; A. A. Stupnikov

Geodesic

Parcourir par

Keyphrase generation for the Russian-language scientific texts using mT5

A. V. Glazkova ; D. A. Morozov ; M. S. Vorobeva ; A. A. Stupnikov

Modelirovanie i analiz informacionnyh sistem, Tome 30 (2023) no. 4, pp. 418-428

Voir la notice de l'article provenant de la source Math-Net.Ru

Résumé

In this work, we applied the multilingual text-to-text transformer (mT5) to the task of keyphrase generation for Russian scientific texts using the Keyphrases CS Russian corpus. The automatic selection of keyphrases is a relevant task of natural language processing since keyphrases help readers find the article easily and facilitate the systematization of scientific texts. In this paper, the task of keyphrase selection is considered as a text summarization task. The mT5 model was fine-tuned on the texts of abstracts of Russian research papers. We used abstracts as an input of the model and lists of keyphrases separated with commas as an output. The results of mT5 were compared with several baselines, including TopicRank, YAKE!, RuTermExtract, and KeyBERT. The results are reported in terms of the full-match F1-score, ROUGE-1, and BERTScore. The best results on the test set were obtained by mT5 and RuTermExtract. The highest F1-score is demonstrated by mT5 (11,24 %), exceeding RuTermExtract by 0,22 %. RuTermextract shows the highest score for ROUGE-1 (15,12 %). According to BERTScore, the best results were also obtained using these methods: mT5 — 76,89 % (BERTScore using mBERT), RuTermExtract — 75,8 % (BERTScore using ruSciBERT). Moreover, we evaluated the capability of mT5 for predicting the keyphrases that are absent in the source text. The important limitations of the proposed approach are the necessity of having a training sample for fine-tuning and probably limited suitability of the fine-tuned model in cross-domain settings. The advantages of keyphrase generation using pre-trained mT5 are the absence of the need for defining the number and length of keyphrases and normalizing produced keyphrases, which is important for flective languages, and the ability to generate keyphrases that are not presented in the text explicitly.

Keywords: automatic text summarization, selecting keyphrases, mT5.

@article{MAIS_2023_30_4_a7,
     author = {A. V. Glazkova and D. A. Morozov and M. S. Vorobeva and A. A. Stupnikov},
     title = {Keyphrase generation for the {Russian-language} scientific texts using {mT5}},
     journal = {Modelirovanie i analiz informacionnyh sistem},
     pages = {418--428},
     publisher = {mathdoc},
     volume = {30},
     number = {4},
     year = {2023},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a7/}
}

TY  - JOUR
AU  - A. V. Glazkova
AU  - D. A. Morozov
AU  - M. S. Vorobeva
AU  - A. A. Stupnikov
TI  - Keyphrase generation for the Russian-language scientific texts using mT5
JO  - Modelirovanie i analiz informacionnyh sistem
PY  - 2023
SP  - 418
EP  - 428
VL  - 30
IS  - 4
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a7/
LA  - ru
ID  - MAIS_2023_30_4_a7
ER  -

%0 Journal Article
%A A. V. Glazkova
%A D. A. Morozov
%A M. S. Vorobeva
%A A. A. Stupnikov
%T Keyphrase generation for the Russian-language scientific texts using mT5
%J Modelirovanie i analiz informacionnyh sistem
%D 2023
%P 418-428
%V 30
%N 4
%I mathdoc
%U http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a7/
%G ru
%F MAIS_2023_30_4_a7

A. V. Glazkova; D. A. Morozov; M. S. Vorobeva; A. A. Stupnikov. Keyphrase generation for the Russian-language scientific texts using mT5. Modelirovanie i analiz informacionnyh sistem, Tome 30 (2023) no. 4, pp. 418-428. http://geodesic.mathdoc.fr/item/MAIS_2023_30_4_a7/