Monolingual and cross-lingual knowledge transfer for topic classification
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 54-71

Voir la notice de l'article provenant de la source Math-Net.Ru

In this work, we investigate knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large number of data points (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the “Yandex Que” raw data. By evaluating the models trained on RuQTopics on the six matching classes from the Russian MASSIVE subset, we show that the RuQTopics dataset is suitable for real-world conversational tasks, as Russian-only models trained on this dataset consistently yield an accuracy around 85% on this subset. We have also found that for the multilingual BERT trained on RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of BERT pretraining data for the corresponding language. At the same time, the correlation of language-wise accuracy with the linguistic distance from the Russian language is not statistically significant.
@article{ZNSL_2023_529_a4,
     author = {D. Karpov and M. Burtsev},
     title = {Monolingual and cross-lingual knowledge transfer for topic classification},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {54--71},
     publisher = {mathdoc},
     volume = {529},
     year = {2023},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a4/}
}
TY  - JOUR
AU  - D. Karpov
AU  - M. Burtsev
TI  - Monolingual and cross-lingual knowledge transfer for topic classification
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2023
SP  - 54
EP  - 71
VL  - 529
PB  - mathdoc
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a4/
LA  - en
ID  - ZNSL_2023_529_a4
ER  - 
%0 Journal Article
%A D. Karpov
%A M. Burtsev
%T Monolingual and cross-lingual knowledge transfer for topic classification
%J Zapiski Nauchnykh Seminarov POMI
%D 2023
%P 54-71
%V 529
%I mathdoc
%U http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a4/
%G en
%F ZNSL_2023_529_a4
D. Karpov; M. Burtsev. Monolingual and cross-lingual knowledge transfer for topic classification. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part II–1, Tome 529 (2023), pp. 54-71. http://geodesic.mathdoc.fr/item/ZNSL_2023_529_a4/