Word-based russian text augmentation for character-level models

R. B. Galinsky; A. M. Alekseev; S. I. Nikolenko

R. B. Galinsky ; A. M. Alekseev ; S. I. Nikolenko

Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 206-221 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Résumé

Large-scale deep learning models, including models for natural language processing, require large datasets for training that could be unavailable for low-resource languages or for special domains. We consider a way to approach the problem of poor variability and small size of available data for training NLP models based on augmenting the data with synonyms. We design a novel augmentation scheme that includes replacing words with synonyms and reshuffling the words, apply it to the Russian language, and report improved results for the sentiment analysis task.

Export
Comment citer

@article{ZNSL_2021_499_a10,
     author = {R. B. Galinsky and A. M. Alekseev and S. I. Nikolenko},
     title = {Word-based russian text augmentation for character-level models},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {206--221},
     year = {2021},
     volume = {499},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a10/}
}

TY  - JOUR
AU  - R. B. Galinsky
AU  - A. M. Alekseev
AU  - S. I. Nikolenko
TI  - Word-based russian text augmentation for character-level models
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2021
SP  - 206
EP  - 221
VL  - 499
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a10/
LA  - en
ID  - ZNSL_2021_499_a10
ER  -

%0 Journal Article
%A R. B. Galinsky
%A A. M. Alekseev
%A S. I. Nikolenko
%T Word-based russian text augmentation for character-level models
%J Zapiski Nauchnykh Seminarov POMI
%D 2021
%P 206-221
%V 499
%U http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a10/
%G en
%F ZNSL_2021_499_a10

R. B. Galinsky; A. M. Alekseev; S. I. Nikolenko. Word-based russian text augmentation for character-level models. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 206-221. http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a10/

Bibliographie
Cité par

[1] N. Abramov, Dictionary of russian synonyms and synonymous phrases, Russkie Slovari, M., 1999

[2] Z. E. Alexandrova, Dictionary of russian synonyms, Russkii Yazyk, M., 2001

[3] Y. Bengio, R. Ducharme, P. Vincent, “A neural probabilistic language model”, Journal of Machine Learning Research, 3 (2003), 1137–1155 | Zbl

[4] Y. Bengio, H. Schwenk, J. S. Senécal, F. Morin, J. L. Gauvain, “Neural probabilistic language models”, Innovations in Machine Learning, Springer, 2006, 137–186

[5] M. D. Bloice, C. Stocker, A. Holzinger, Augmentor: an image augmentation library for machine learning, 2017, arXiv: 1708.04680

[6] J. A. Botha, P. Blunsom, “Compositional morphology for word representations and language modelling”, Proc. 31th International Conference on Machine Learning, ICML 2014 (Beijing, China, 21–26 June 2014), 2014, 1899–1907

[7] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, J. C. Lai, “Class-based n-gram models of natural language”, Comput. Linguist, 18:4 (1992), 467–479

[8] S. F. Chen, J. Goodman, “An empirical study of smoothing techniques for language modeling”, Proc. 34th Annual Meeting on Association for Computational Linguistics, ACL96 (Stroudsburg, PA, USA), Association for Computational Linguistics, 1996, 310–318 | DOI

[9] F. Chollet, Keras, , 2015 https://github.com/fchollet/keras | Zbl

[10] R. Cotterell, H. Schütze, J. Eisner, “Morphological smoothing and extrapolation of word embeddings”, Proc. 54th Annual Meeting of the ACL, ACL 2016, Long Papers (Berlin, Germany, August 7-12, 2016), v. 1, 2016

[11] C. Fellbaum (ed.), WordNet: an electronic lexical database, MIT Press, 1998 | Zbl

[12] R. Galinsky, A. Alekseev, S. I. Nikolenko, “Improving neural network models for natural language processing in russian with synonyms”, Proc. 5th conference on Artificial Intelligence and Natural Language, 2016, 45–51

[13] Y. Goldberg, A primer on neural network models for natural language processing, 2015, arXiv: 1510.00726

[14] J. T. Goodman, “A bit of progress in language modeling”, Comput. Speech Lang, 15:4 (2001), 403–434 | DOI

[15] A. Graves, S. Fernández, J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition”, Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005, 15th International Conference (Warsaw, Poland, September 11–15, 2005), v. II, 2005, 799–804 | DOI

[16] A. Graves, J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures”, Neural Networks, 18:5-6 (2005), 602–610 | DOI

[17] J. Howard, S. Ruder, “Universal language model fine-tuning for text classification”, Proc. 56th Annual Meeting of the Association for Computational Linguistics, v. 1, Long Papers, 2018, 328–339 | DOI

[18] A. B. Jung, imgaug, , 2018 (accessed 30-Dec-2018) https://github.com/aleju/imgaug

[19] K. Kann, H. Schütze, “Single-model encoder-decoder with explicit morphological representation for reinflection”, Proc. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 (August 7–12, 2016, Berlin, Germany), v. 2, Short Papers, 2016

[20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv: 1412.6980 | Zbl

[21] R. Kneser, H. Ney, “Improved backing-off for m-gram language modeling”, 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-95, v. 1, 1995, 181–184 | DOI

[22] S. Kobayashi, “Contextual augmentation: Data augmentation by words with paradigmatic relations”, Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (New Orleans, Louisiana), v. 2, Short Papers, Association for Computational Linguistics, 2018, 452–457

[23] M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages”, Analysis of Images, Social Networks and Texts, Communications in Computer and Information Science, 542, eds. M. Yu. Khachay, N. Konstantinova, A. Panchenko, D.I. Ignatov, V.G. Labunets, Springer International Publishing, 2015, 320–332 (English) | DOI

[24] O. Kozlowa, A. Kutuzov, “Improving distributional semantic models using anaphora resolution during linguistic preprocessing”, Proceedings of International Conference on Computational Linguistics “Dialogue 2016”, 2016

[25] Y. LeCun, K. Kavukcuoglu, C. Farabet, “Convolutional networks and applications in vision”, International Symposium on Circuits and Systems (ISCAS 2010) (May 30–June 2, 2010, Paris, France), 2010, 253–256

[26] W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, T. Luis, “Finding function in form: Compositional character models for open vocabulary word representation”, Proc. 2015 Conference on Empirical Methods in Natural Language Processing (Lisbon, Portugal), Association for Computational Linguistics, 2015, 1520–1530 | DOI

[27] N. Loukachevitch, M. Nokel, K. Ivanov, “Combining thesaurus knowledge and probabilistic topic models”, International Conference on Analysis of Images, Social Networks and Texts, Springer, 2017, 59–71

[28] M. T. Luong, R. Socher, C. D. Manning, “Better word representations with recursive neural networks for morphology”, CoNLL (Sofia, Bulgaria), 2013

[29] V. Malykh, “Robust word vectors for russian language”, Proceedings of Artificial Intelligence and Natural Language AINL FRUCT 2016 Conference (Saint-Petersburg, Russia, 2016), 10–12 | Zbl

[30] V. Malykh, “Generalizable architecture for robust word vectors tested by noisy paraphrases”, Proc. of The 6th International Conference On Analysis Of Images, Social Networks, and Texts (AIST), 2017

[31] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013, arXiv: 1301.3781

[32] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, “Recurrent neural network based language model”, INTERSPEECH, v. 2, 2010, 3

[33] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernockỳ, S. Khudanpur, “Extensions of recurrent neural network language model”, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2011, 5528–5531 | DOI

[34] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, 2013, arXiv: 1310.4546

[35] G. A. Miller, “Wordnet: a lexical database for english”, Communications of the ACM, 38:11 (1995), 39–41 | DOI

[36] A. Mnih, G. E. Hinton, “A scalable hierarchical distributed language model”, Advances in neural information processing systems, 2009, 1081–1088

[37] M. Ranzato, G. E. Hinton, Y. LeCun, “Guest editorial: Deep learning”, International Journal of Computer Vision, 113:1 (2015), 1–2 | DOI | MR

[38] S. Ruder, An overview of multi-task learning in deep neural networks, 2017, arXiv: 1706.05098

[39] R. Sennrich, B. Haddow, A. Birch, “Edinburgh neural machine translation systems for WMT 16”, Proc. First Conference on Machine Translation, Shared Task Papers, v. 2, ACL, 2016, 371–376

[40] V. Solovyev, V. Ivanov, “Knowledge-driven event extraction in russian: corpus-based linguistic resources”, Computational intelligence and neuroscience, 2016 (2016), 16 | DOI

[41] R. Soricut, F. Och, “Unsupervised morphology induction using word embeddings”, Proc. 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Denver, Colorado), ACL, 2015, 1627–1637 | DOI

[42] E. Tutubalina, S. Nikolenko, “Constructing aspect-based sentiment lexicons with topic modeling”, International Conference on Analysis of Images, Social Networks and Texts, Springer, 2016, 208–220

[43] W. Y. Wang, D. Yang, “That's so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets”, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (Lisbon, Portugal), ACL, 2015, 2557–2563 | DOI

[44] X. Wang, H. Pham, Z. Dai, G. Neubig, “SwitchOut: an efficient data augmentation algorithm for neural machine translation”, Proc. 2018 Conference on Empirical Methods in Natural Language Processing, ACL, 2018, 856–861 | DOI

[45] Z. Xie, S. I. Wang, J. Li, D. L-evy, A. Nie, D. Jurafsky, A. Y. Ng, “Data noising as smoothing in neural network language models”, 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings (April 24–26, 2017, Toulon, France), 2017 OpenReview.net

[46] X. Zhang, J. Zhao, Y. LeCun, “Character-level convolutional networks for text classification”, Advances in Neural Information Processing Systems, 28, eds. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, Curran Associates, Inc, 2015, 649–657

Parcourir par

Geodesic

Parcourir par