Robust word vectors: context-informed embeddings for noisy texts
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 248-266 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

We suggest a new language-independent architecture of robust word vectors (RoVe). It is designed to alleviate the issue of typos and misspellings, common in almost any user-generated content, which hinder automatic text processing. Our model is morphologically motivated, which allows it to deal with unseen word forms in morphologically rich languages. We present the results on a number of natural language processing (NLP) tasks and languages for a variety of related architectures and show that the proposed architecture is robust to typos.
@article{ZNSL_2021_499_a13,
     author = {T. Khakhulin and V. Logacheva and V. Malykh},
     title = {Robust word vectors: context-informed embeddings for noisy texts},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {248--266},
     year = {2021},
     volume = {499},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a13/}
}
TY  - JOUR
AU  - T. Khakhulin
AU  - V. Logacheva
AU  - V. Malykh
TI  - Robust word vectors: context-informed embeddings for noisy texts
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2021
SP  - 248
EP  - 266
VL  - 499
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a13/
LA  - en
ID  - ZNSL_2021_499_a13
ER  - 
%0 Journal Article
%A T. Khakhulin
%A V. Logacheva
%A V. Malykh
%T Robust word vectors: context-informed embeddings for noisy texts
%J Zapiski Nauchnykh Seminarov POMI
%D 2021
%P 248-266
%V 499
%U http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a13/
%G en
%F ZNSL_2021_499_a13
T. Khakhulin; V. Logacheva; V. Malykh. Robust word vectors: context-informed embeddings for noisy texts. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 248-266. http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a13/

[1] V. M. Andrjushchenko, Koncepcija i architektura mashinnogo fonda russkogo jazyka, 1989

[2] S. Arora, Y. Li, Y. Liang, T. Ma, A. Risteski, Linear algebraic structure of word senses, with applications to polysemy, 2016

[3] R. Astudillo, S. Amir, W. Ling, M. Silva, I. Trancoso, “Learning word representations from scarce and noisy data with embedding subspaces”, Proc. 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, v. 1, Long Papers, Association for Computational Linguistics, 2015, 1074–1084

[4] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, 2016

[5] S. Cucerzan, E. Brill, “Spelling correction as an iterative process that exploits the collective knowledge of web users”, Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, v. 4, 2004, 293–300

[6] S. Demir, I. D. El-Kahlout, E. Unal, H. Kaya, “Turkish paraphrase corpus”, Proc. 8th International Conference on Language Resources and Evaluation, LREC'12 (Istanbul, Turkey), eds. Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis, European Language Resources Association (ELRA) (english)

[7] B. Dolan, C. Quirk, C. Brockett, Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources, 2004

[8] M F. Porter, Snowball: A language for stemming algorithms, v. 1, 2001

[9] T. Fawcett, “An introduction to ROC analysis”, Pattern recognition letters, 27:8 (2006), 861–874 | DOI | MR

[10] A. Graves, A.-R. Mohamed, G. Hinton, “Speech recognition with deep recurrent neural networks”, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, 38 pp.

[11] S. Hochreiter, J. Schmidhuber, “Long short-term memory”, Neural Computation, 9:8 (1997), 1735–1780 | DOI

[12] N. Kalchbrenner, E. Grefenstette, P. Blunsom, “A convolutional neural network for modelling sentences”, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, v. 1, 2014, 655-665 | DOI

[13] D. Kiela, C. Wang, K. Cho, Context-attentive embeddings for improved sentence representations, 2018, arXiv: 1804.07983

[14] A. Kutuzov, I. Andreev, Texts in, meaning out: neural language models in semantic similarity task for russian, 2015

[15] T. Lei, Y. Zhang, Training RNNs as fast as CNNs, 2017

[16] D. Lewis, F. Li, T. Rose, Y. Yang, Reuters corpus volume 1 as a text categorization test collection, 2004

[17] Q. Li, S. Shah, X. Liu, A. Nourbakhsh, Data sets: Word embeddings learned from tweets and general data, 2017

[18] W. Ling, T. Luís, L. Marujo, R. Fernández Astudillo, S. Amir, C. Dyer, A. W. Black, I. Trancoso, Finding function in form: Compositional character models for open vocabulary word representation, 2015, arXiv: 1508.02096

[19] N. Loukachevitch, Y. Rubtsova, “Entity-oriented sentiment analysis of tweets: Results and problems”, Text, Speech, and Dialogue (Chan), eds. P. Král, V. Matoušek, Springer International Publishing, 2015, 551–555

[20] B. McCann, J. Bradbury, C. Xiong, R. Socher, Learned in translation: Contextualized word vectors, 2017

[21] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed representations of words and phrases and their compositionality”, NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems, 2013, 3111–3119

[22] K.A. Nguyen, S. Schulte im Walde, N. Thang Vu, Neural-based noise filtering from word embeddings, 2016, arXiv: 1610.01874

[23] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, 2018, arXiv: 1802.05365

[24] Y. Pinter, R. Guthrie, J. Eisenstein, Mimicking word embeddings using subword RNNs, 2017, arXiv: 1707.06961

[25] A.A. Polikarpov, “Towards the foundations of Menzerath's law”, Contributions to the Science of Text and Language, 31 (2007), 215–240

[26] E. Pronoza, E. Yagunova, A. Pronoza, “Construction of a russian paraphrase corpus: Unsupervised paraphrase extraction”, Proc. RuSSIR 2015, 2016, 146–157

[27] S. R. Bowman, G. Angeli, C. Potts, C. Manning, “A large annotated corpus for learning natural language inference”, Proc. EMNLP 2015, 2015, 632–642

[28] K. Sakaguchi, K. Duh, M. Post, B. Van Durme, “Robsut wrod reocginiton via semi-character recurrent neural network”, Proc. AAAI 2017, 2017, 3281–3287

[29] J. Saxe, K. Berlin, expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys, arXiv: 1702.08568

[30] M. Schuster, K.K. Paliwal, “Bidirectional recurrent neural networks”, IEEE Transactions on Signal Processing, 45 (1997), 2673–2681 | DOI

[31] I. Segalovich, “A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine”, Machine Learning; Models, Technologies and Applications, 2003, 273–280

[32] M. Seo, A. Kembhavi, A. Farhadi, H. Hajishirzi, Bidirectional attention flow for machine comprehension, 2016

[33] N. Smith, J. Eisner, “Contrastive estimation: Training log-linear models on unlabeled data”, Proc. 43rd ACL, 2005, 354–362

[34] R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y. Ng, C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank”, Proc. 2013 EMNLP, 2013, 1631–1642

[35] E. Vylomova, T. Cohn, X. He, G. Haffari, Word representation models for morphologically rich languages in neural machine translation, 2016, arXiv: 1606.04217

[36] J. Wehrmann, W. Becker, H. E. L. Cagnini, R. C. Barros, “A character-based convolutional neural network for language-agnostic twitter sentiment analysis”, IJCNN-2017: International Joint Conference on Neural Networks, 2017, 2384–2391

[37] O. Yildirim, F. Atik, M. F. Amasyali, 42 bin haber veri kumesi, 2003

[38] X. Zhang, J. J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, 2015, arXiv: 1509.01626