Deep learning for natural language processing: a survey
Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 137-205 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Over the last decade, deep learning has revolutionized machine learning. Neural network architectures have become the method of choice for many different applications; in this paper, we survey the applications of deep learning to natural language processing (NLP) problems. We begin by briefly reviewing the basic notions and major architectures of deep learning, including some recent advances that are especially important for NLP. Then we survey distributed representations of words, showing both how word embeddings can be extended to sentences and paragraphs and how words can be broken down further in character-level models. Finally, the main part of the survey deals with various deep architectures that have either arisen specifically for NLP tasks or have become a method of choice for them; the tasks include sentiment analysis, dependency parsing, machine translation, dialog and conversational models, question answering, and other applications. Disclaimer: this survey was written in 2016 and reflects the state of the art at the time. Although the field of deep learning moves very quickly, and all directions outlined here have already found many new developments, we hope that this survey can still be useful as an overview of already classical works in the field and a systematic introduction to deep learning for natural language processing.
@article{ZNSL_2021_499_a9,
     author = {E. Arkhangelskaya and S. Nikolenko},
     title = {Deep learning for natural language processing: a survey},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {137--205},
     year = {2021},
     volume = {499},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a9/}
}
TY  - JOUR
AU  - E. Arkhangelskaya
AU  - S. Nikolenko
TI  - Deep learning for natural language processing: a survey
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2021
SP  - 137
EP  - 205
VL  - 499
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a9/
LA  - en
ID  - ZNSL_2021_499_a9
ER  - 
%0 Journal Article
%A E. Arkhangelskaya
%A S. Nikolenko
%T Deep learning for natural language processing: a survey
%J Zapiski Nauchnykh Seminarov POMI
%D 2021
%P 137-205
%V 499
%U http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a9/
%G en
%F ZNSL_2021_499_a9
E. Arkhangelskaya; S. Nikolenko. Deep learning for natural language processing: a survey. Zapiski Nauchnykh Seminarov POMI, Investigations on applied mathematics and informatics. Part I, Tome 499 (2021), pp. 137-205. http://geodesic.mathdoc.fr/item/ZNSL_2021_499_a9/

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. ManGc, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. ViGcgas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, , 2015 tensorflow.org

[2] C. Aggarwal, P. Zhao, “Graphical models for text: A new paradigm for text representation and processing”, SIGIR '10, ACM, 2010, 899–900 | DOI

[3] R. Al-Rfou, B. Perozzi, S. Skiena, “Polyglot: Distributed word representations for multilingual nlp”, Proc. 17th Conference on Computational Natural Language Learning (Sofia, Bulgaria), ACL, 2013, 183–192

[4] G. Angeli, C. D. Manning, “Naturalli: Natural logic inference for common sense reasoning”, Proc. EMNLP (Doha, Qatar, October 2014), ACL, 2014, 534–545

[5] E. Arisoy, T. N. Sainath, B. Kingsbury, B. Ramabhadran, “Deep neural network language models”, Proc. NAACL-HLT 2012 Workshop: Will We Ever Really Replace the gram Model? On the Future of Language Modeling for HLT, ACL, 2012, 20–28

[6] J. Ba, V. Mnih, K. Kavukcuoglu, “Multiple object recognition with visual attention”, ICLR'15, 2015

[7] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014, arXiv: 1409.0473

[8] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, Y. Bengio, End-to-end attention-based large vocabulary speech recognition, 2015, arXiv: 1508.04395

[9] M. Ballesteros, C. Dyer, N. A. Smith, “Improved transition-based parsing by modeling characters instead of words with lstms”, Proc. EMNLP 2015 (Lisbon, Portugal), ACL, 2015, 349–359

[10] P. Baltescu, P. Blunsom, “Pragmatic neural language modelling in machine translation”, NAACL HLT 2015, 2015, 820–829

[11] L. Banarescu, C. Bonial, S. Cai, M. Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, N. Schneider, “Abstract meaning representation for sembanking”, Proc. 7th Linguistic Annotation Workshop and Interoperability with Discourse (Sofia, Bulgaria, August 2013), ACL, 178–186

[12] R. E. Banchs, “Movie-dic: A movie dialogue corpus for research and development”, ACL '12, ACL, 2012, 203–207

[13] M. Baroni, R. Zamparelli, “Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space”, EMNLP '10, ACL, 2010, 1183–1193

[14] S. Bartunov, D. Kondrashkin, A. Osokin, D. P. Vetrov, “Breaking sticks and ambiguities with adaptive skip-gram”, Proc. 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016 (Cadiz, Spain, May 9–11, 2016), 2016, 130–138

[15] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, Y. Bengio, “Theano: New features and speed improvements”, Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012

[16] Y. Bengio, R. Ducharme, P. Vincent, “A neural probabilistic language model”, Journal of Machine Learning Research, 3 (2003), 1137–1155 | Zbl

[17] Y. Bengio, “Learning deep architectures for ai”, Foundations and Trends in Machine Learning, 2:1 (2009), 1–127 | DOI | MR | Zbl

[18] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures”, Neural Networks: Tricks of the Trade, Second Edition, 2012, 437–478 | DOI

[19] Y. Bengio, A. Courville, P. Vincent, “Representation learning: A review and new perspectives”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:8 (2013), 1798–1828 | DOI

[20] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, “Greedy layer-wise training of deep networks”, NIPS'06, MIT Press, 2006, 153–160

[21] Y. Bengio, H. Schwenk, J.-S. SenGccal, F. Morin, J.-L. Gauvain, “Neural probabilistic language models”, Innovations in Machine Learning, Springer, 2006, 137–186

[22] Y. Bengio, L. Yao, G. Alain, P. Vincent, Generalized denoising auto-encoders as generative models, 2013, arXiv: 1305.6663 | Zbl

[23] J. Berant, A. Chou, R. Frostig, P. Liang, “Semantic parsing on Freebase from question-answer pairs”, Proc. 2013 EMNLP (Seattle, Washington, USA, October 2013), ACL, 1533–1544

[24] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, Y. Bengio, “Theano: a CPU and GPU math expression compiler”, Proc. Python for Scientific Computing Conference (SciPy), 2010 (Oral Presentation)

[25] D. P. Bertsekas, Convex analysis and optimization, Athena Scientific, 2003 | Zbl

[26] J. Bian, B. Gao, T.-Y. Liu, “Knowledge-powered deep learning for word embedding”, Machine Learning and Knowledge Discovery in Databases, Springer, 2014, 132–148 | DOI

[27] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006 | Zbl

[28] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge”, SIGMOD '08, ACM, 2008, 1247–1250 | DOI

[29] D. Bollegala, T. Maehara, K.-i. Kawarabayashi, “Unsupervised cross-domain word representation learning”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 730–740

[30] F. Bond, K. Paik, “A Survey of WordNets and their Licenses”, GWC 2012, 2012, 64–71

[31] A. Bordes, X. Glorot, J. Weston, Y. Bengio, “Joint learning of words and meaning representations for open-text semantic parsing”, JMLR 2012

[32] A. Bordes, X. Glorot, J. Weston, Y. Bengio, “A semantic matching energy function for learning with multi-relational data”, Machine Learning, 94:2 (2013), 233–259 | DOI

[33] A. Bordes, N. Usunier, S. Chopra, J. Weston, Large-scale simple question answering with memory networks, 2015, arXiv: 1506.02075 | Zbl

[34] A. Borisov, I. Markov, M. de Rijke, P. Serdyukov, “A neural click model for web search”, WWW '16, ACM, 2016 (to appear)

[35] E. Boros, R. Besançon, O. Ferret, B. Grau, “Event role extraction using domain-relevant word representations”, Proc. EMNLP (Doha, Qatar, October 2014), ACL, 1852–1857

[36] J. A. Botha, P. Blunsom, “Compositional morphology for word representations and language modelling”, Proc. 31th ICML, 2014, 1899–1907

[37] H. Bourlard, Y. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Manuscript M217, Philips Research Laboratory, Brussels, Belgium, 1987

[38] O. Bousquet, U. Luxburg, G. Ratsch (eds.), Advanced lectures on machine learning, Springer, 2004 | Zbl

[39] S. R. Bowman, C. Potts, C. D. Manning, Learning distributed word representations for natural logic reasoning, 2014, arXiv: 1410.4176

[40] S. R. Bowman, C. Potts, C. D. Manning, Recursive neural networks for learning logical semantics, 2014, arXiv: 1406.1827

[41] A. Bride, T. Van de Cruys, N. Asher, “A generalisation of lexical functions for composition in distributional semantics”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 281–291

[42] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, J. C. Lai, “Class-based n-gram models of natural language”, Comput. Linguist., 18:4 (1992), 467–479

[43] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, R. L. Mercer, “The mathematics of statistical machine translation: Parameter estimation”, Comput. Linguist., 19:2 (1993), 263–311

[44] J. Buys, P. Blunsom, “Generative incremental dependency parsing with neural networks”, Proc. 53rd ACL and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, v. 2, Short Papers, 2015, 863–869

[45] E. Cambria, “Affective computing and sentiment analysis”, IEEE Intelligent Systems, 31:2 (2016) | DOI

[46] Z. Cao, S. Li, Y. Liu, W. Li, d H. Ji, “A novel neural topic model and its supervised extension”, Proc. 29th AAAI Conference on Artificial Intelligence (January 25–30, 2015, Austin, Texas, USA), 2015, 2210–2216

[47] X. Carreras, L. Marquez, “Introduction to the conll-2005 shared task: Semantic role labeling”, CONLL '05, ACL, 2005, 152–164 | DOI

[48] B. Chen, H. Guo, “Representation based translation evaluation metrics”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China, July 2015), v. 2, Short Papers, ACL, 150–155

[49] D. Chen, R. Socher, C. D. Manning, A. Y. Ng, “Learning new facts from knowledge bases with neural tensor networks and semantic word vectors”, International Conference on Learning Representations (ICLR), 2013

[50] M. Chen, Z. E. Xu, K. Q. Weinberger, F. Sha, “Marginalized denoising autoencoders for domain adaptation”, Proc. 29th ICML, 2012 icml.cc/Omnipress

[51] S. F. Chen, J. Goodman, “An empirical study of smoothing techniques for language modeling”, ACL '96, ACL, 1996, 310–318

[52] X. Chen, Y. Zhou, C. Zhu, X. Qiu, X. Huang, “Transition-based dependency parsing using two heterogeneous gated recursive neural networks”, Proc. EMNLP 2015 (Lisbon, Portugal), ACL, 2015, 1879–1889

[53] Y. Chen, L. Xu, K. Liu, D. Zeng, J. Zhao, “Event extraction via dynamic multi-pooling convolutional neural networks”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 167–176

[54] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, cudnn: Efficient primitives for deep learning, 2014, arXiv: 1410.0759

[55] K. Cho, Introduction to neural machine translation with gpus, 2015

[56] K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, 2014, arXiv: 1409.1259

[57] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation”, Proc. 2014 EMNLP (Doha, Qatar), ACL, 2014, 1724–1734

[58] K. Cho, B. van Merrienboer, Ç. Gulçehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation”, Proc. EMNLP 2014, 2014, 1724–1734

[59] F. Chollet, Keras, 2015 https://github.com/fchollet/keras | Zbl

[60] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition, 2015, arXiv: 1506.07503

[61] J. Chung, K. Cho, Y. Bengio, A character-level decoder without explicit segmentation for neural machine translation, 2016, arXiv: 1603.06147

[62] J. Chung, Ç. Gulçehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014, arXiv: 1412.3555

[63] S. Clark, B. Coecke, M. Sadrzadeh, “A compositional distributional model of meaning”, Proc. Second Symposium on Quantum Interaction, QI-2008, 2008, 133–140

[64] S. Clark, B. Coecke, M. Sadrzadeh, “Mathematical foundations for a compositional distributed model of meaning”, Linguistic Analysis, 36:1–4 (2011), 345–384

[65] B. Coecke, M. Sadrzadeh, S. Clark, Mathematical foundations for a compositional distributional model of meaning, 2010, arXiv: 1003.4394

[66] R. Collobert, S. Bengio, J. Marithoz, Torch: A modular machine learning software library, 2002

[67] R. Collobert, J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning”, Proc. 25th international conference on Machine learning, ACM, 2008, 160–167 | DOI

[68] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, “Natural language processing (almost) from scratch”, Journal of Machine Learning Research, 12 (2011), 2493–2537 | Zbl

[69] T. Cooijmans, N. Ballas, C. Laurent, A. Courville, Recurrent batch normalization, 2016, arXiv: 1603.09025

[70] L. Deng, Y. Liu (eds.), Deep learning in natural language processing, Springer, 2018

[71] L. Deng, D. Yu, “Deep learning: Methods and applications”, Foundations and Trends in Signal Processing, 7:3–4 (2014), 197–387 | DOI

[72] L. Deng, D. Yu, “Deep learning: Methods and applications”, Foundations and Trends in Signal Process, 7:3–4 (2014), 197–387 | DOI

[73] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, J. Makhoul, “Fast and robust neural network joint models for statistical machine translation”, Proc. 52nd ACL (Baltimore, Maryland, June 2014), v. 1, Long Papers, ACL, 1370–1380

[74] N. Djuric, H. Wu, V. Radosavljevic, M. Grbovic, N. Bhamidipati, “Hierarchical neural language models for joint representation of streaming documents and their content”, WWW '15, ACM, 2015, 248–255 | DOI

[75] B. Dolan, C. Quirk, C. Brockett, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources”, COLING '04, ACL, 2004

[76] L. Dong, F. Wei, M. Zhou, K. Xu, “Question answering over freebase with multi-column convolutional neural networks”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 260–269

[77] S. A. Duffy, J. M. Henderson, R. K. Morris, “Semantic facilitation of lexical access during sentence processing”, Journal of Experimental Psychology: Learning, Memory, and Cognition, 15 (1989), 791–801 | DOI

[78] G. Durrett, D. Klein, Neural CRF parsing, 2015, arXiv: 1507.03641

[79] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, N. A. Smith, “Transition-based dependency parsing with stack long short-term memory”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 334–343

[80] J. L. Elman, “Finding structure in time”, Cognitive Science, 14:2 (1990), 179–211 | DOI

[81] K. Erk, “Representing words as regions in vector space”, CoNLL '09, ACL, 2009, 57–65 | DOI

[82] A. Fader, L. Zettlemoyer, O. Etzioni, “Paraphrase-driven learning for open question answering”, Proc. 51st ACL (Sofia, Bulgaria, August 2013), v. 1, Long Papers, ACL, 1608–1618

[83] C. Fellbaum (ed.), WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA, 1998 | Zbl

[84] C. Fellbaum, “Wordnet and wordnets”, Encyclopedia of Language and Linguistics (Oxford), ed. K. Brown, Elsevier, 2005, 665–670

[85] D. A. Ferrucci, E. W. Brown, J. Chu{-}Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. M. Prager, N. Schlaefer, C. A. Welty, “Building Watson: An overview of the DeepQA project”, AI Magazine, 31:3 (2010), 59–79 | DOI

[86] O. Firat, K. Cho, Y. Bengio, Multi-way, multilingual neural machine translation with a shared attention mechanism, 2016, arXiv: 1601.01073

[87] D. Fried, T. Polajnar, S. Clark, “Low-rank tensors for verbs in compositional distributional semantics”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 2, Short Papers, ACL, 2015, 731–736

[88] K. Fukushima, “Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron”, Transactions of the IECE, J62-A:10 (1979), 658–665

[89] K. Fukushima, “Neocognitron: A self-organizing neural network for a mechanism of pattern recognition unaffected by shift in position”, Biological Cybernetics, 36:4 (1980), 193–202 | DOI | Zbl

[90] Y. Gal, A theoretically grounded application of dropout in recurrent neural networks, 2015, arXiv: 1512.05287

[91] Y. Gal, Z. Ghahramani, “Dropout as a Bayesian approximation: Insights and applications”, Deep Learning Workshop, ICML 2015

[92] J. Gao, X. He, W. tau Yih, L. Deng, “Learning continuous phrase representations for translation modeling”, Proc. ACL 2014, ACL, 2014

[93] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng, Y. Shen, “Modeling interestingness with deep neural networks”, EMNLP 2014

[94] F. A. Gers, J. Schmidhuber, F. Cummins, “Learning to forget: Continual prediction with LSTM”, Neural Computation, 12:10 (2000), 2451–2471 | DOI

[95] F. A. Gers, J. Schmidhuber, “Recurrent nets that time and count”, Proc. IEEE-INNS-ENNS International Joint Conference on Neural Networks, IJCNN 2000, v. 3, IEEE, 2000, 189–194 | DOI

[96] L. Getoor, B. Taskar, Introduction to statistical relational learning (adaptive computation and machine learning), MIT Press, 2007

[97] F. Girosi, M. Jones, T. Poggio, “Regularization theory and neural networks architectures”, Neural Computation, 7:2 (1995), 219–269 | DOI

[98] X. Glorot, Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, International conference on artificial intelligence and statistics (2010), 249–256

[99] X. Glorot, A. Bordes, Y. Bengio, “Deep sparse rectifier networks”, AISTATS, 15 (2011), 315–323

[100] X. Glorot, A. Bordes, Y. Bengio, “Domain adaptation for large-scale sentiment classification: A deep learning approach”, Proc. 28th ICML, 2011, 513–520

[101] Y. Goldberg, A primer on neural network models for natural language processing, 2015, arXiv: 1510.00726

[102] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT Press, 2016 http://www.deeplearningbook.org | Zbl

[103] J. T. Goodman, “A bit of progress in language modeling”, Comput. Speech Lang., 15:4 (2001), 403–434 | DOI

[104] A. Graves, Generating sequences with recurrent neural networks, 2013, arXiv: 1308.0850

[105] A. Graves, S. Fernandez, J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition”, Artificial Neural Networks: Formal Models and Their Applications, ICANN 2005, 15th International Conference, Proceedings (Warsaw, Poland, September 11–15, 2005), v. II, 2005, 799–804

[106] A. Graves, J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures”, Neural Networks, 18:5–6 (2005), 602–610 | DOI

[107] E. Grefenstette, Towards a formal distributional semantics: Simulating logical calculi with tensors, 2013, arXiv: 1304.5823

[108] E. Grefenstette, M. Sadrzadeh, “Experimental support for a categorical compositional distributional model of meaning”, EMNLP '11, ACL, 2011, 1394–1404

[109] E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke, S. Pulman, “Concrete sentence spaces for compositional distributional models of meaning”, Proc. 9th International Conference on Computational Semantics, IWCS11, 2011, 125–134

[110] E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke, S. Pulman, “Concrete sentence spaces for compositional distributional models of meaning”, Computing Meaning, Springer, 2014, 71–86 | DOI

[111] K. Greff, R. K. Srivastava, J. Koutnek, B. R. Steunebrink, J. Schmidhuber, LSTM: A search space odyssey, 2015, arXiv: 1503.04069

[112] J. Gu, Z. Lu, H. Li, V. O. K. Li, Incorporating copying mechanism in sequence-to-sequence learning, 2016, arXiv: 1603.06393

[113] H. Guo, Generating text with deep reinforcement learning, 2015, arXiv: 1510.09202

[114] S. Guo, Q. Wang, B. Wang, L. Wang, L. Guo, “Semantically smooth knowledge graph embedding”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 84–94

[115] R. Gupta, C. Orasan, J. van Genabith, “Reval: A simple and effective machine translation evaluation metric based on recurrent neural networks”, Proc. 2015 EMNLP (Lisbon, Portugal, September 2015), ACL, 2015, 1066–1072

[116] F. Guzmán, S. Joty, L. Marquez, P. Nakov, “Pairwise neural machine translation evaluation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China, July 2015), v. 1, Long Papers, ACL, 805–814

[117] D. Hall, G. Durrett, D. Klein, “Less grammar, more features”, Proc. 52nd ACL (Baltimore, Maryland, June 2014), v. 1, Long Papers, ACL, 228–237

[118] A. L. F. Han, D. F. Wong, L. S. Chao, “LEPOR: A robust evaluation metric for machine translation with augmented factors”, Proc. COLING 2012, Posters (Mumbai, India, The COLING 2012 Organizing Committee, December 2012), 441–450 | Zbl

[119] S. J. Hanson, L. Y. Pratt, “Comparing biases for minimal network construction with back-propagation”, Advances in Neural Information Processing Systems, NIPS, v. 1, eds. D. S. Touretzky, Morgan Kaufmann, San Mateo, CA, 1989, 177–185

[120] K. He, X. Zhang, S. Ren, J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”, Proc. ICCV 2015, 2015, 1026–1034 | Zbl

[121] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition”, Proc. 2016 CVPR, 2016, 770–778

[122] K. M. Hermann, P. Blunsom, “Multilingual models for compositional distributed semantics”, Proc. 52nd ACL (Baltimore, Maryland), v. 1, Long Papers, ACL, 2014, 58–68

[123] K. M. Hermann, T. Koc̆isky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, P. Blunsom, Teaching machines to read and comprehend, 2015, arXiv: 1506.03340

[124] F. Hill, K. Cho, A. Korhonen, Learning distributed representations of sentences from unlabelled data, 2016, arXiv: 1602.03483

[125] G. E. Hinton, J. L. McClelland, “Learning representations by recirculation”, Neural Information Processing Systems, ed. D. Z. Anderson, American Institute of Physics, 1988, 358–366

[126] G. E. Hinton, S. Osindero, Y.-W. Teh, “A fast learning algorithm for deep belief nets”, Neural Computation, 18:7 (2006), 1527–1554 | DOI | MR | Zbl

[127] G. E. Hinton, R. S. Zemel, “Autoencoders, minimum description length and helmholtz free energy”, Advances in Neural Information Processing Systems, 6, eds. J. D. Cowan, G. Tesauro, J. Alspector, Morgan-Kaufmann, 1994, 3–10

[128] S. Hochreiter, Untersuchungen zu dynamischen neuronalen Netzen, Diploma thesis, Institut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen, 1991 (Advisor: J. Schmidhuber)

[129] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies”, A Field Guide to Dynamical Recurrent Neural Networks, eds. S. C. Kremer, J. F. Kolen, IEEE Press, 2001

[130] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Tech. Report FKI-207-95, Fakultat fur Informatik, Technische Universitat Munchen, 1995

[131] S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory”, Neural Computation, 9:8 (1997), 1735–1780 | DOI

[132] B. Hu, Z. Lu, H. Li, Q. Chen, “Convolutional neural network architectures for matching natural language sentences”, Advances in Neural Information Processing Systems, 27, eds. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger, Curran Associates, Inc, 2014, 2042–2050

[133] E. H. Huang, R. Socher, C. D. Manning, A. Y. Ng, “Improving word representations via global context and multiple word prototypes”, ACL '12, ACL, 2012, 873–882

[134] E. H. Huang, R. Socher, C. D. Manning, A. Y. Ng, “Improving word representations via global context and multiple word prototypes”, Proc. 50th ACL, v. Long Papers, 1, ACL, 2012, 873–882

[135] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, L. Heck, “Learning deep structured semantic models for web search using clickthrough data”, Proc. CIKM, 2013

[136] D. H. Hubel, T. N. Wiesel, “Receptive fields and functional architecture of monkey striate cortex”, Journal of Physiology (London), 195 (1968), 215–243 | DOI

[137] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015, arXiv: 1502.03167

[138] O. Irsoy, C. Cardie, “Opinion mining with deep recurrent neural networks”, Proc. EMNLP, 2014, 720–728

[139] M. Iyyer, J. Boyd-Graber, L. Claudino, R. Socher, H. Daumé III, “A neural network for factoid question answering over paragraphs”, Empirical Methods in Natural Language Processing, 2014

[140] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object recognition?, Proc. 12th ICCV, 2009, 2146–2153

[141] S. Jean, K. Cho, R. Memisevic, Y. Bengio, “On using very large target vocabulary for neural machine translation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 1–10

[142] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, 2014, arXiv: 1408.5093

[143] M. Joshi, M. Dredze, W. W. Cohen, C. P. Rose, “What's in a domain? multi-domain learning for multi-attribute data”, Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Atlanta, Georgia), ACL, 2013, 685–690

[144] A. Joulin, T. Mikolov, Inferring algorithmic patterns with stack-augmented recurrent nets, 2015, arXiv: 1503.01007

[145] M. Kageback, O. Mogren, N. Tahmasebi, D. Dubhashi, “Extractive summarization using continuous vector space models”, Proc. 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)\@ EACL, 2014, 31–39 | DOI

[146] L. Kaiser, I. Sutskever, Neural gpus learn algorithms, 2015, arXiv: 1511.08228

[147] N. Kalchbrenner, P. Blunsom, “Recurrent continuous translation models”, EMNLP, v. 3, 2013, 413

[148] N. Kalchbrenner, P. Blunsom, Recurrent convolutional neural networks for discourse compositionality, 2013, arXiv: 1306.3584

[149] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, 2014, arXiv: 1404.2188

[150] N. Kalchbrenner, E. Grefenstette, P. Blunsom, “A convolutional neural network for modelling sentences”, Proc. 52nd ACL (Baltimore, Maryland), v. 1, Long Papers, ACL, 2014, 655–665

[151] A. Karpathy, The unreasonable effectiveness of recurrent neural networks, 2015

[152] D. Kartsaklis, M. Sadrzadeh, S. Pulman, “A unified sentence space for categorical distributional-compositional semantics: Theory and experiments”, Proc. 24th International Conference on Computational Linguistics, COLING, Posters (Mumbai, India), 2012, 549–558

[153] T. Kenter, M. de Rijke, “Short text similarity with word embeddings”, CIKM '15, ACM, 2015, 1411–1420 | DOI

[154] Y. Kim, “Convolutional neural networks for sentence classification”, Proc. 2014 EMNLP (Doha, Qatar, October 2014), ACL, 2014, 1746–1751

[155] Y. Kim, Y. Jernite, D. Sontag, A. M. Rush, Character-aware neural language models, 2015, arXiv: 1508.06615

[156] D. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv: 1412.6980 | Zbl

[157] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv: 1412.6980 | Zbl

[158] D. P. Kingma, T. Salimans, M. Welling, “Variational dropout and the local reparameterization trick”, Advances in Neural Information Processing Systems, 28, eds. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, Curran Associates, Inc, 2015, 2575–2583

[159] S. Kiritchenko, X. Zhu, S. M. Mohammad, “Sentiment analysis of short informal texts”, Journal of Artificial Intelligence Research, 2014, 723–762 | DOI

[160] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, S. Fidler, “Skip-thought vectors”, Advances in Neural Information Processing Systems, 28, eds. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, Curran Associates, Inc, 2015, 3294–3302

[161] R. Kneser, H. Ney, “Improved backing-off for m-gram language modeling”, Proc. ICASSP-95, v. 1, 1995, 181–184

[162] P. Koehn, Statistical machine translation, 1st ed., Cambridge University Press, New York, NY, USA, 2010 | Zbl

[163] O. Kolomiyets, M.-F. Moens, “A survey on question answering technology from an information retrieval perspective”, Inf. Sci., 181:24 (2011), 5412–5434 | DOI | MR

[164] A. Krogh, J. A. Hertz, “A simple weight decay can improve generalization”, Advances in Neural Information Processing Systems, 4, eds. D. S. Lippman, J. E. Moody, D. S. Touretzky, Morgan Kaufmann, 1992, 950–957

[165] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, R. Socher, Ask me anything: Dynamic memory networks for natural language processing, 2015, arXiv: 1506.07285

[166] J. Lafferty, A. McCallum, F. C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data

[167] T. . Landauer, S. T. Dumais, “A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge”, Psychological review, 104:2 (1997), 211–240 | DOI

[168] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation”, ICML '07, ACM, 2007, 473–480 | DOI

[169] H. Larochelle, G. E. Hinton, “Learning to combine foveal glimpses with a third-order boltzmann machine”, Advances in Neural Information Processing Systems, 23, eds. J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, A. Culotta, Curran Associates, Inc, 2010, 1243–1251

[170] A. Lavie, K. Sagae, S. Jayaraman, “The Significance of Recall in Automatic Metrics for MT Evaluation”, Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, Technical Papers, Springer, Berlin–Heidelberg, 2004, 134–143

[171] Q. V. Le, N. Jaitly, G. E. Hinton, A simple way to initialize recurrent networks of rectified linear units, 2015, arXiv: 1504.00941

[172] Q. V. Le, T. Mikolov, Distributed representations of sentences and documents, 2014, arXiv: 1405.4053

[173] Y. LeCun, “Une procedure d'apprentissage pour rGcseau a seuil asymGctrique”, Proc. Cognitiva 85 (Paris, 1985), 599–604

[174] Y. LeCun, Modeles connexionnistes de l'apprentissage (connectionist learning models), Ph.D. thesis, Universite P. et M. Curie, Paris 6, 1987

[175] Y. LeCun, “A theoretical framework for back-propagation”, Proc. 1988 Connectionist Models Summer School (CMU, Pittsburgh, Pa), eds. D. Touretzky, G. Hinton, T. Sejnowski, Morgan Kaufmann, 1988, 21–28

[176] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, “Gradient-based learning applied to document recognition”, Intelligent Signal Processing, IEEE Press, 2001, 306–351

[177] Y. LeCun, F. Fogelman-Soulie, “Modeles connexionnistes de l'apprentissage”, Intellectica, 1987, special issue apprentissage et machine

[178] Y. LeCun, Y. Bengio, G. Hinton, “Human-level control through deep reinforcement learning”, Nature, 521 (2015), 436–444 | DOI

[179] Y. LeCun, K. Kavukcuoglu, C. Farabet, “Convolutional networks and applications in vision”, Proc. ISCAS 2010, 2010, 253–256

[180] O. Levy, Y. Goldberg, I. Ramat-Gan, “Linguistic regularities in sparse and explicit word representations”, CoNLL, 2014, 171–180

[181] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, J. Gao, “Deep reinforcement learning for dialogue generation”, Proc. 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016 (Austin, Texas, USA, November 1–4, 2016), 2016, 1192–1202

[182] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, 2015, arXiv: 1509.02971

[183] C. Lin, Y. He, R. Everson, S. Ruger, “Weakly supervised joint sentiment-topic detection from text”, IEEE Transactions on Knowledge and Data Engineering, 24:6 (2012), 1134–1145 | DOI

[184] C.-Y. Lin, F. J. Och, “Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics”, ACL '04, ACL, 2004

[185] Z. Lin, W. Wang, X. Jin, J. Liang, D. Meng, “A word vector and matrix factorization based method for opinion lexicon extraction”, WWW '15 Companion, ACM, 2015, 67–68 | DOI

[186] W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, T. Luis, “Finding function in form: Compositional character models for open vocabulary word representation”, Proc. EMNLP 2015 (Lisbon, Portugal), ACL, 2015, 1520–1530

[187] S. Linnainmaa, The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors, Master's thesis, Univ. Helsinki, 1970

[188] B. Liu, Sentiment analysis and opinion mining, Synthesis Lectures on Human Language Technologies, 5, Morgan Claypool Publishers, 2012 | DOI

[189] B. Liu, Sentiment analysis: mining opinions, sentiments, and emotions, Cambridge University Press, 2015

[190] B. Liu, Sentiment analysis: Mining opinions, sentiments, and emotions, Cambridge University Press, 2015

[191] C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, J. Pineau, “How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation”, Proc. EMNLP 2016, 2016, 2122–2132

[192] P. Liu, X. Qiu, X. Huang, “Learning context-sensitive word embeddings with neural tensor skip-gram model”, IJCAI'15, AAAI Press, 2015, 1284–1290

[193] Y. Liu, Z. Liu, T.-S. Chua, M. Sun, “Topical word embeddings”, AAAI'15, AAAI Press, 2015, 2418–2424

[194] A. Lopez, “Statistical machine translation”, ACM Comput. Surv., 40:3 (2008), 8:1–8:49 | DOI

[195] R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-Gontier, Y. Bengio, J. Pineau, Towards an automatic turing test: Learning to evaluate dialogue responses, Submitted to ICLR 2017, 2017

[196] R. Lowe, N. Pow, I. Serban, J. Pineau, The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems, 2015, arXiv: 1506.08909

[197] Q. Luo, W. Xu, “Learning word vectors efficiently using shared representations and document representations”, AAAI'15, AAAI Press, 2015, 4180–4181

[198] Q. Luo, W. Xu, J. Guo, “A study on the cbow model's overfitting and stability”, Web-KR '14, ACM, 2014, 9–12

[199] M.-T. Luong, M. Kayser, C. D. Manning, “Deep neural language models for machine translation”, Proc. Conference on Natural Language Learning, CoNLL (Beijing, China), ACL, 2015, 305–309

[200] M.-T. Luong, R. Socher, C. D. Manning, “Better word representations with recursive neural networks for morphology”, CoNLL (Sofia, Bulgaria), 2013

[201] T. Luong, H. Pham, C. D. Manning, “Effective approaches to attention-based neural machine translation”, Proc. EMNLP (Lisbon, Portugal, September 2015), ACL, 2015, 1412–1421

[202] T. Luong, I. Sutskever, Q. Le, O. Vinyals, W. Zaremba, “Addressing the rare word problem in neural machine translation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 11–19

[203] M. Ma, L. Huang, B. Xiang, B. Zhou, “Dependency-based convolutional neural networks for sentence embedding”, Proc. ACL 2015, v. 2, Short Papers, 2015, 174 | Zbl

[204] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, “Learning word vectors for sentiment analysis”, HLT '11, ACL, 2011, 142–150

[205] B. MacCartney, C. D. Manning, “An extended model of natural logic”, Proc. Eight International Conference on Computational Semantics (Tilburg, The Netherlands, January 2009), ACL, 140–156

[206] D. J. MacKay, Information theory, inference and learning algorithms, Cambridge University Press, 2003 | Zbl

[207] C. D. Manning, “Computational linguistics and deep learning”, Computational Linguistics, 2016

[208] C. D. Manning, P. Raghavan, H. Schutze, Introduction to information retrieval, Cambridge University Press, 2008 | Zbl

[209] M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, R. Zamparelli, “Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment”, SemEval–2014, v. 1, 2014 | Zbl

[210] B. Marie, A. Max, “Multi-pass decoding with complex feature guidance for statistical machine translation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China, July 2015), v. 2, Short Papers, ACL, 554–559

[211] W. McCulloch, W. Pitts, “A logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, 7 (1943), 115–133 | DOI

[212] F. Meng, Z. Lu, M. Wang, H. Li, W. Jiang, Q. Liu, “Encoding source language with convolutional neural network for machine translation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 20–30

[213] T. Mikolov, Statistical language models based on neural networks, Ph. D. thesis, Brno University of Technology, 2012

[214] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013, arXiv: 1301.3781

[215] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, S. Khudanpur, “Recurrent neural network based language model”, INTERSPEECH, v. 2, 2010, 3

[216] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernocky, S. Khudanpur, “Extensions of recurrent neural network language model”, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2011, 5528–5531

[217] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, 2013, arXiv: 1310.4546

[218] J. Mitchell, M. Lapata, “Composition in distributional models of semantics”, Cognitive Science, 34:8 (2010), 1388–1429 | DOI

[219] J. Mitchell, M. Lapata, “Composition in distributional models of semantics”, Cognitive Science, 34:8 (2010), 1388–1429 | DOI

[220] A. Mnih, G. E. Hinton, “A scalable hierarchical distributed language model”, Advances in neural information processing systems, 2009, 1081–1088

[221] A. Mnih, K. Kavukcuoglu, “Learning word embeddings efficiently with noise-contrastive estimation”, Advances in Neural Information Processing Systems, 26, eds. C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Q. Weinberger, Curran Associates, Inc., 2013, 2265–2273

[222] V. Mnih, N. Heess, A. Graves, k. Kavukcuoglu, “Recurrent models of visual attention”, Advances in Neural Information Processing Systems, 27, eds. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger, Curran Associates, Inc, 2014, 2204–2212

[223] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, “Playing atari with deep reinforcement learning”, NIPS Deep Learning Workshop (2013) | Zbl

[224] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, “Human-level control through deep reinforcement learning”, Nature, 518:7540 (2015), 529–533 | DOI

[225] G. Montavon, G. B. Orr, K. Muller (eds.), Neural networks: Tricks of the trade, Lecture Notes in Computer Science, 7700, second edition, Springer, 2012 | DOI

[226] L. Morgenstern, C. L. Ortiz, “The winograd schema challenge: Evaluating progress in commonsense reasoning”, AAAI'15, AAAI Press, 2015, 4024–4025

[227] K. P. Murphy, Machine learning: a probabilistic perspective, Cambridge University Press, 2013

[228] A. Neelakantan, B. Roth, A. McCallum, “Compositional vector space models for knowledge base completion”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 156–166

[229] V. Ng, C. Cardie, “Improving machine learning approaches to coreference resolution”, ACL '02, ACL, 2002, 104–111

[230] Y. Oda, G. Neubig, S. Sakti, T. Toda, S. Nakamura, “Syntax-based simultaneous translation through prediction of unseen syntactic constituents”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 198–207

[231] M. Osborne, S. Moran, R. McCreadie, A. Von Lunen, M. Sykora, E. Cano, N. Ireson, C. Macdonald, I. Ounis, Y. He, T. Jackson, F. Ciravegna, A. O'Brien, “Real-time detection, tracking, and monitoring of automatically discovered events in social media”, Proc. 52nd ACL: System Demonstrations (Baltimore, Maryland, June 2014), ACL, 37–42 | MR

[232] B. Pang, L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales”, ACL '05, ACL, 2005, 115–124 | DOI

[233] B. Pang, L. Lee, “Opinion mining and sentiment analysis”, Foundations and trends in information retrieval, 2:1–2 (2008), 1–135 | DOI | MR

[234] P. Pantel, “Inducing ontological co-occurrence vectors”, ACL '05, ACL, 2005, 125–132 | DOI

[235] D. Paperno, N. T. Pham, M. Baroni, “A practical and linguistically-motivated approach to compositional distributional semantics”, Proc. 52nd ACL (Baltimore, Maryland), v. 1, Long Papers, ACL, 2014, 90–99

[236] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation”, Proc. 40th ACL, ACL, 2002, 311–318

[237] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation”, ACL '02, ACL, 2002, 311–318

[238] D. B. Parker, Learning-logic, Tech. Report TR-47, Center for Comp. Research in Economics and Management Sci., MIT, 1985

[239] R. Pascanu, Ç. Gulçehre, K. Cho, Y. Bengio, How to construct deep recurrent neural networks, 2013, arXiv: 1312.6026

[240] Y. Peng, S. Wang, B.-L. Lu, “Marginalized Denoising Autoencoder via Graph Regularization for Domain Adaptation”, Lecture Notes in Computer Science, 8227, Springer, Berlin–Heidelberg, 2013, 156–163 | DOI

[241] J. Pennington, R. Socher, C. Manning, “Glove: Global vectors for word representation”, Proc. 2014 EMNLP (Doha, Qatar), ACL, 2014, 1532–1543

[242] J. Pouget{-}Abadie, D. Bahdanau, B. van Merrienboer, K. Cho, Y. Bengio, Overcoming the curse of sentence length for neural machine translation using automatic segmentation, 2014, arXiv: 1409.1257

[243] L. Prechelt, Early Stopping — But When?, Neural Networks: Tricks of the Trade, Springer, Berlin–Heidelberg, 2012, 53–67 | DOI

[244] J. Preiss, M. Stevenson, “Unsupervised domain tuning to improve word sense disambiguation”, Proc. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Atlanta, Georgia), ACL, 2013, 680–684

[245] S. Prince, Computer vision: Models, learning, and inference, Cambridge University Press, 2012 | Zbl

[246] A. Ramesh, S. H. Kumar, J. Foulds, L. Getoor, “Weakly supervised models of aspect-sentiment for online course discussion forums”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 74–83

[247] R. S. Randhawa, P. Jain, G. Madan, Topic modeling using distributed word embeddings, 2016, arXiv: 1603.04747 | Zbl

[248] M. Ranzato, G. E. Hinton, Y. LeCun, “Guest editorial: Deep learning”, International Journal of Computer Vision, 113:1 (2015), 1–2 | DOI | MR

[249] J. Reisinger, R. J. Mooney, “Multi-prototype vector-space models of word meaning”, HLT'10, ACL, 2010, 109–117

[250] X. Rong, word2vec parameter learning explained, 2014, arXiv: 1411.2738

[251] F. Rosenblatt, Principles of neurodynamics, Spartan, New York, 1962 | Zbl

[252] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain”, Psychological Review, 65:6 (1958), 386–408 | DOI

[253] H. Rubenstein, J. B. Goodenough, “Contextual correlates of synonymy”, Communications of the ACM, 8:10 (1965), 627–633 | DOI

[254] A. A. Rusu, S. G. Colmenarejo, Ç. Gulçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, R. Hadsell, Policy distillation, 2015, arXiv: 1511.06295

[255] M. Sachan, K. Dubey, E. Xing, M. Richardson, “Learning answer-entailing structures for machine comprehension”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 239–249

[256] M. Sadrzadeh, E. Grefenstette, “A compositional distributional semantics, two concrete constructions, and some experimental evaluations”, QI'11, Springer-Verlag, 2011, 35–47

[257] M. Sahlgren, “The Distributional Hypothesis”, Italian Journal of Linguistics, 20:1 (2008), 33–54

[258] R. Salakhutdinov, “Learning Deep Generative Models”, Annual Review of Statistics and Its Application, 2:1 (2015), 361–385 | DOI

[259] R. Salakhutdinov, G. Hinton, “An efficient learning procedure for deep boltzmann machines”, Neural Computation, 24:8 (2012), 1967–2006 | DOI | MR | Zbl

[260] R. Salakhutdinov, G. E. Hinton, “Deep boltzmann machines”, Proc. Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS (Clearwater Beach, Florida, USA, April 16–18, 2009), 2009, 448–455

[261] J. Schmidhuber, “Deep learning in neural networks: An overview”, Neural Networks, 61 (2015), 85–117 | DOI

[262] M. Schuster, On supervised learning from sequential data with applications for speech recognition, Ph.D. thesis, Nara Institute of Science and Technolog, Kyoto, Japan, 1999

[263] M. Schuster, K. K. Paliwal, “Bidirectional recurrent neural networks”, IEEE Transactions on Signal Processing, 45:11 (1997), 2673–2681 | DOI

[264] H. Schwenk, “Continuous space language models”, Comput. Speech Lang., 21:3 (2007), 492–518 | DOI

[265] I. V. Serban, A. G. O. II, J. Pineau, A. C. Courville, Multi-modal variational encoder-decoders, 2016

[266] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, J. Pineau, Hierarchical neural network generative models for movie dialogues, 2015, arXiv: 1507.04808

[267] I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, Y. Bengio, “A hierarchical latent variable encoder-decoder model for generating dialogues”, Proc. 31st AAAI, 2017, 3295–3301

[268] H. Setiawan, Z. Huang, J. Devlin, T. Lamar, R. Zbib, R. Schwartz, J. Makhoul, “Statistical machine translation features with multitask tensor networks”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 31–41

[269] A. Severyn, A. Moschitti, “Learning to rank short text pairs with convolutional deep neural networks”, SIGIR'15, ACM, 2015, 373–382

[270] K. Shah, R. W. M. Ng, F. Bougares, L. Specia, “Investigating continuous space language models for machine translation quality estimation”, Proc. EMNLP (Lisbon, Portugal, September 2015), ACL, 2015, 1073–1078

[271] L. Shang, Z. Lu, H. Li, “Neural responding machine for short-text conversation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China, July 2015), v. 1, Long Papers, ACL, 1577–1586

[272] Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil, “A latent semantic model with convolutional-pooling structure for information retrieval”, CIKM '14, ACM, 2014, 101–110 | DOI

[273] C. Silberer, M. Lapata, “Learning grounded meaning representations with autoencoders”, ACL, 1 (2014), 721–732

[274] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, “Mastering the Game of Go with Deep Neural Networks and Tree Search”, Nature, 529:7587 (2016), 484–489 | DOI

[275] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, J. Makhoul, “A study of translation edit rate with targeted human annotation”, Proc. Association for Machine Translation in the Americas, 2006, 223–231

[276] R. Snow, S. Prakash, D. Jurafsky, A. Y. Ng, “Learning to Merge Word Senses”, Proc. Joint Meeting of the Conference on Empirical Methods on Natural Language Processing and the Conference on Natural Language Learning, 2007, 1005–1014

[277] R. Socher, J. Bauer, C. D. Manning, A. Y. Ng, “Parsing with compositional vector grammars”, Proc. ACL, 2013, 455–465

[278] R. Socher, D. Chen, C. D. Manning, A. Ng, “Reasoning with neural tensor networks for knowledge base completion”, Advances in Neural Information Processing Systems, NIPS, 2013

[279] R. Socher, E. H. Huang, J. Pennin, C. D. Manning, A. Y. Ng, “Dynamic pooling and unfolding recursive autoencoders for paraphrase detection”, Advances in Neural Information Processing Systems, 2011, 801–809

[280] R. Socher, A. Karpathy, Q. Le, C. Manning, A. Ng, “Grounded compositional semantics for finding and describing images with sentences”, Transactions of the Association for Computational Linguistics, 2014 (2014)

[281] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, C. D. Manning, “Semi-supervised recursive autoencoders for predicting sentiment distributions”, Proc. EMNLP 2011, ACL, 2011, 151–161

[282] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank”, Proc. EMNLP 2013, 2013, 1631–1642

[283] Y. Song, H. Wang, X. He, “Adapting deep ranknet for personalized search”, WSDM 2014, ACM, 2014

[284] A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, J.-Y. Nie, “A hierarchical recurrent encoder-decoder for generative context-aware query suggestion”, CIKM '15, ACM, 2015, 553–562 | DOI

[285] R. Soricut, F. Och, “Unsupervised morphology induction using word embeddings”, Proc. 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Denver, Colorado), ACL, 2015, 1627–1637 | DOI

[286] B. Speelpenning, Compiling fast partial derivatives of functions given by algorithms, Ph.D. thesis, Department of Computer Science, University of Illinois, Urbana-Champaign, 1980

[287] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting”, Journal of Machine Learning Research, 15:1 (2014), 1929–1958 | Zbl

[288] R. K. Srivastava, K. Greff, J. Schmidhuber, “Training very deep networks”, NIPS'15, MIT Press, 2015, 2377–2385

[289] P. Stenetorp, “Transition-based dependency parsing using recursive neural networks”, Deep Learning Workshop at NIPS 2013

[290] J. Su, D. Xiong, Y. Liu, X. Han, H. Lin, J. Yao, M. Zhang, “A context-aware topic model for statistical machine translation”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 229–238

[291] P.-H. Su, M. Gasic, N. Mrks̆i, L. M. Rojas Barahona, S. Ultes, D. Vandyke, T.-H. Wen, S. Young, “On-line active reward learning for policy optimisation in spoken dialogue systems”, Proc. 54th ACL (Berlin, Germany, August 2016), v. 1, Long Papers, ACL, 2431–2441

[292] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus, Weakly supervised memory networks, 2015

[293] F. Sun, J. Guo, Y. Lan, J. Xu, X. Cheng, “Learning word representations by jointly modeling syntagmatic and paradigmatic relations”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 136–145

[294] I. Sutskever, G. E. Hinton, “Deep, narrow sigmoid belief networks are universal approximators”, Neural Computation, 20:11 (2008), 2629–2636 | DOI | Zbl

[295] I. Sutskever, J. Martens, G. Hinton, “Generating text with recurrent neural networks”, ICML'11, ACM, 2011, 1017–1024

[296] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, 2014, arXiv: 1409.3215 | Zbl

[297] Y. Tagami, H. Kobayashi, S. Ono, A. Tajima, “Modeling user activities on the web using paragraph vector”, WWW '15 Companion, ACM, 2015, 125–126 | DOI

[298] K. S. Tai, R. Socher, C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks”, Proc. 53rd ACL and 7th IJCNLP, v. 1, 2015, 1556–1566

[299] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, “Deepface: Closing the gap to human-level performance in face verification”, CVPR '14, IEEE Computer Society, 2014, 1701–1708

[300] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, “Learning sentiment-specific word embedding for twitter sentiment classification”, ACL, 1 (2014), 1555–1565

[301] W. T. Yih, X. He, C. Meek, “Semantic parsing for single-relation question answering”, Proc. ACL, ACL, 2014

[302] J. Tiedemann, “News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces”, Recent Advances in Natural Language Processing (Amsterdam/Philadelphia), v. V, eds. N. Nicolov, K. Bontcheva, G. Angelova, R. Mitkov, John Benjamins, Amsterdam–Philadelphia, 2009, 237–248 | DOI

[303] I. Titov, J. Henderson, “A latent variable model for generative dependency parsing”, IWPT '07, ACL, 2007, 144–155 | DOI

[304] E. F. Tjong Kim Sang, S. Buchholz, “Introduction to the conll-2000 shared task: Chunking”, ConLL '00, ACL, 2000, 127–132

[305] B. Y. Tong Zhang, “Boosting with early stopping: Convergence and consistency”, The Annals of Statistics, 33:4 (2005), 1538–1579 | MR | Zbl

[306] K. Toutanova, D. Klein, C. D. Manning, Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network”, NAACL '03, ACL, 2003, 173–180 | DOI

[307] Y. Tsuboi, H. Ouchi, Neural dialog models: A survey, 2015 http://2boy.org/ỹuta/publications/neural-dialog-models-survey-20150906.pdf

[308] J. Turian, L. Ratinov, Y. Bengio, “Word representations: A simple and general method for semi-supervised learning”, ACL '10, ACL, 2010, 384–394

[309] P. D. Turney, P. Pantel, et al., “From frequency to meaning: Vector space models of semantics”, Journal of artificial intelligence research, 37:1 (2010), 141–188 | DOI | MR | Zbl

[310] E. Tutubalina, S. I. Nikolenko, “Constructing aspect-based sentiment lexicons with topic modeling”, Proc. 5th International Conference on Analysis of Images, Social Networks, and Texts, AIST 2016, 2016 (to appear)

[311] B. van Merriënboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde{-}Farley, J. Chorowski, Y. Bengio, Blocks and fuel: Frameworks for deep learning, 2015, arXiv: 1506.00619

[312] D. Venugopal, C. Chen, V. Gogate, V. Ng, “Relieving the computational bottleneck: Joint inference for event extraction with high-dimensional features”, Proc. EMNLP (Doha, Qatar), ACL, 2014, 831–843

[313] P. Vincent, “A connection between score matching and denoising autoencoders”, Neural Computation, 23:7 (2011), 1661–1674 | DOI | MR | Zbl

[314] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders”, ICML '08, ACM, 2008, 1096–1103 | DOI

[315] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion”, Journal of Machine Learning Research, 11 (2010), 3371–3408 | Zbl

[316] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. E. Hinton, Grammar as a foreign language, 2014, arXiv: 1412.7449 | Zbl

[317] O. Vinyals, Q. V. Le, “A neural conversational model”, ICML Deep Learning Workshop, 2015, arXiv: 1506.05869 | Zbl

[318] V. Viswanathan, N. F. Rajani, Y. Bentor, R. Mooney, “Stacked ensembles of information extractors for knowledge-base population”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 177–187

[319] X. Wang, Y. Liu, C. Sun, B. Wang, X. Wang, “Predicting polarities of tweets by composing word embeddings with long short-term memory”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 1343–1353

[320] D. Weiss, C. Alberti, M. Collins, S. Petrov, “Structured training for neural network transition-based parsing”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 323–333

[321] J. Weizenbaum, “Eliza — a computer program for the study of natural language communication between man and machine”, Communications of the ACM, 9:1 (1966), 36–45 | DOI

[322] T. Wen, M. Gasic, N. Mrksic, L. M. Rojas{-}Barahona, P. Su, S. Ultes, D. Vandyke, S. J. Young, “Conditional generation and snapshot learning in neural dialogue systems”, Proc. Conference on Empirical Methods in Natural Language Processing, EMNLP 2016 (Austin, Texas, USA, November 1–4, 2016), 2016, 2153–2162 | DOI

[323] P. J. Werbos, “Applications of advances in nonlinear sensitivity analysis”, Proc. 10th IFIP Conference, 1981, 762–770

[324] P. J. Werbos, “Backpropagation through time: what it does and how to do it”, Proc. IEEE, 78:10 (1990), 1550–1560 | DOI

[325] P. J. Werbos, “Backwards differentiation in AD and neural nets: Past links and new opportunities”, Automatic Differentiation: Applications, Theory, and Implementations, Springer, 2006, 15–34 | DOI | MR | Zbl

[326] J. Weston, A. Bordes, S. Chopra, T. Mikolov, Towards ai-complete question answering: A set of prerequisite toy tasks, 2015, arXiv: 1502.05698

[327] J. Weston, S. Chopra, A. Bordes, Memory networks, 2014, arXiv: 1410.3916

[328] L. White, R. Togneri, W. Liu, and M. Bennamoun, “How well sentence embeddings capture meaning”, ADCS '15, ACM, 2015, 9:1–9:8

[329] R. J. Williams, D. Zipser, “Gradient-based learning algorithms for recurrent networks and their computational complexity”, Backpropagation (Hillsdale, NJ, USA), eds. Y. Chauvin, D. E. Rumelhart, L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1995, 433–486

[330] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, J. Dean, Google's neural machine translation system: Bridging the gap between human and machine translation, 2016, arXiv: 1609.08144

[331] Z. Wu, C. L. Giles, “Sense-aware semantic analysis: A multi-prototype word representation model using wikipedia”, AAAI'15, AAAI Press, 2015, 2188–2194

[332] S. Wubben, A. van den Bosch, E. Krahmer, “Paraphrase generation as monolingual translation: Data and evaluation”, INLG '10, ACL, 2010, 203–207

[333] C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, T.-Y. Liu, “Rc-net: A general framework for incorporating knowledge into word representations”, CIKM '14, ACM, 2014, 1219–1228

[334] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, 2015, arXiv: 1502.03044

[335] R. Xu, D. Wunsch, Clustering, Wiley-IEEE Press, 2008

[336] X. Xue, J. Jeon, W. B. Croft, “Retrieval models for question and answer archives”, SIGIR '08, ACM, 2008, 475–482 | DOI

[337] M. Yang, T. Cui, W. Tu, Ordering-sensitive and semantic-aware topic modeling, 2015, arXiv: 1502.03630

[338] Y. Yang, J. Eisenstein, “Unsupervised multi-domain adaptation with feature embeddings”, Proc. 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Denver, Colorado), ACL, 2015, 672–682 | DOI

[339] Z. Yang, X. He, J. Gao, L. Deng, A. J. Smola, Stacked attention networks for image question answering, 2015, arXiv: 1511.02274

[340] K. Yao, G. Zweig, B. Peng, Attention with intention for a neural network conversation model, 2015, arXiv: 1510.08565

[341] X. Yao, J. Berant, B. Van Durme, Freebase qa: Information extraction or semantic parsing?, Proc. ACL, Workshop on Semantic Parsing (Baltimore, MD, June 2014), ACL, 2014, 82–86

[342] Y. Yao, L. Rosasco, A. Caponnetto, “On early stopping in gradient descent learning”, Constructive Approximation, 26:2 (2007), 289–315 | DOI | MR | Zbl

[343] W.-t. Yih, M.-W. Chang, C. Meek, A. Pastusiak, “Question answering using enhanced lexical semantic models”, Proc. 51st ACL (Sofia, Bulgaria, August 2013), v. 1, Long Papers, ACL, 1744–1753

[344] W.-t. Yih, G. Zweig, J. C. Platt, “Polarity inducing latent semantic analysis”, EMNLP-CoNLL '12, ACL, 2012, 1212–1222

[345] W. Yin, H. Schutze, “Multigrancnn: An architecture for general matching of text chunks on multiple levels of granularity”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 63–73

[346] W. Yin, H. Schutze, B. Xiang, B. Zhou, ABCNN: attention-based convolutional neural network for modeling sentence pairs, 2015, arXiv: 1512.05193

[347] J. Yohan, O. A. H., “Aspect and sentiment unification model for online review analysis”, WSDM '11, ACM, 2011, 815–824

[348] A. M. Z. Yang, A. Kotov, S. Lu, “Parametric and non-parametric user-aware sentiment topic models”, Proc. 38th ACM SIGIR, 2015

[349] W. Zaremba, I. Sutskever, Reinforcement learning neural Turing machines, 2015, arXiv: 1505.00521

[350] W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization, 2014, arXiv: 1409.2329 | Zbl

[351] M. D. Zeiler, ADADELTA: an adaptive learning rate method, 2012, arXiv: 1212.5701 | Zbl

[352] L. S. Zettlemoyer, M. Collins, Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars, 2012, arXiv: 1207.1420

[353] X. Zhang, Y. LeCun, Text understanding from scratch, 2015, arXiv: 1502.01710

[354] X. Zhang, J. Zhao, Y. LeCun, “Character-level convolutional networks for text classification”, Advances in Neural Information Processing Systems, 28, eds. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett, Curran Associates, Inc, 2015, 649–657

[355] G. Zhou, T. He, J. Zhao, P. Hu, “Learning continuous word embedding with metadata for question retrieval in community question answering”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 250–259

[356] H. Zhou, Y. Zhang, S. Huang, J. Chen, “A neural probabilistic structured-prediction model for transition-based dependency parsing”, Proc. 53rd ACL and the 7th IJCNLP (Beijing, China), v. 1, Long Papers, ACL, 2015, 1213–1222