Recovering word forms by context for morphologically rich languages
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 129-136 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

In this work, we focus on “sentence-level unlemmatization”, the task of generating a grammatical sentence given a lemmatized one, which can usually be easily done by humans. We treat this setting as a machine translation problem and – as a first try – apply a sequence-to-sequence model to the texts of Russian Wikipedia articles, evaluate the effect of the different training sets sizes quantitatively and achieve the BLUE score of 67,3 using the largest training set available. We discuss preliminary results and flaws of traditional machine translation evaluation methods for this task and suggest directions for future research.
@article{ZNSL_2021_499_a8,
     author = {A. M. Alekseev and S. I. Nikolenko},
     title = {Recovering word forms by context for~morphologically~rich~languages},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {129--136},
     year = {2021},
     volume = {499},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a8/}
}
TY  - JOUR
AU  - A. M. Alekseev
AU  - S. I. Nikolenko
TI  - Recovering word forms by context for morphologically rich languages
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2021
SP  - 129
EP  - 136
VL  - 499
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a8/
LA  - en
ID  - ZNSL_2021_499_a8
ER  - 
%0 Journal Article
%A A. M. Alekseev
%A S. I. Nikolenko
%T Recovering word forms by context for morphologically rich languages
%J Zapiski Nauchnykh Seminarov POMI
%D 2021
%P 129-136
%V 499
%U http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a8/
%G en
%F ZNSL_2021_499_a8
A. M. Alekseev; S. I. Nikolenko. Recovering word forms by context for morphologically rich languages. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 129-136. http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a8/

[1] I. Anisimov, V. Polyakov, E. Makarova, V. Solovyev, “Spelling correction in english: Joint use of bi-grams and chunking”, Intelligent Systems Conference (IntelliSys), IEEE, 2017, 886–892

[2] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014, arXiv: 1409.0473

[3] D. Gavrilov, P. Kalaidin, V. Malykh, Self-attentive model for headline generation, 2019, arXiv: 1901.07786

[4] S. Hochreiter, J. Schmidhuber, “Long short-term memory”, Neural Comput., 9:8 (1997), 1735–1780 | DOI

[5] G. Klein, Y. Kim, Y. Deng, J. Senellart, A. M. Rush, OpenNMT: Open-Source Toolkit for Neural Machine Translation, arXiv: 1701.02810

[6] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, Chr. Moran, Zens R., et al., “Moses: Open source toolkit for statistical machine translation”, Proc. the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Association for Computational Linguistics, 2007, 177–180 | DOI

[7] M. Korobov, “Morphological analyzer and generator for russian and ukrainian languages”, Analysis of Images, Social Networks and Texts, Communications in Computer and Information Science, 542, eds. M. Yu. Khachay, N. Konstantinova, A. Panchenko, D. I. Ignatov, V. G. Labunets, Springer International Publishing, 2015, 320–332 (English) | DOI

[8] J. Lee, K. Cho, T. Hofmann, “Fully character-level neural machine translation without explicit segmentation”, Transactions of the Association for Computational Linguistics, 5 (2017), 365–378 | DOI

[9] M.-T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-based neural machine translation, 2015, arXiv: 1508.04025

[10] Z. Miftahutdinov, E. Tutubalina, “Deep learning for ICD coding: Looking for medical concepts in clinical documents in english and in French”, Experimental IR Meets Multilinguality, Multimodality, and Interaction (Cham), eds. P. Bellot, Ch. Trabelsi, J. Mothe, F. Murtagh, J. Y. Nie, L. Soulier, E. SanJuan, Linda Cappellato, Nicola Ferro, Springer International Publishing, 2018, 203–215 | DOI

[11] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, NIPS-W, 2017

[12] D. Polykovskiy, D. Soloviev, S. Nikolenko, “Concorde: Morphological agreement in conversational models”, Proc. The 10th Asian Conference on Machine Learning, Proceedings of Machine Learning Research, 95, eds. Jun Zhu, Ichiro Takeuchi, PMLR, 407–421

[13] I. Segalovich, “A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine”, MLMTA, Citeseer, 2003, 273–280

[14] D. Sukhonin, A. Panchenko, A python wrapper of the yandex mystem 3.1 morphological analyzer, 2013 https://github.com/nlpub/pymystem3

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, 2017, arXiv: 1706.03762 | Zbl