An overview of methods for deep learning in neural networks
Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 6 (2017) no. 3, pp. 28-59 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice de l'article

At present, deep learning is becoming one of the most popular approach to creation of the artificial intelligences systems such as speech recognition, natural language processing, computer vision and so on. Thepaper presents a historical overview of deep learning in neural networks. The model of the artificial neural networkis described as well as the learning algorithms for neural networks including the error backpropagation algorithm, which is used to train deep neural networks. The development of neural networks architectures is presentedincluding neocognitron, autoencoders, convolutional neural networks, restricted Boltzmann machine, deep beliefnetworks, long short-term memory, gated recurrent neural networks, and residual networks. Training deep neuralnetworks with many hidden layers is impeded by the vanishing gradient problem. The paper describes theapproaches to solve this problem that provide the ability to train neural networks with more than hundred layers.An overview of popular deep learning libraries is presented. Nowadays, for computer vision tasks convolutionalneural networks are utilized, while for sequence processing, including natural language processing, recurrentnetworks are preferred solution, primarily long short-term memory networks and gated recurrent neural networks.
Keywords: deep learning, neural networks, machine learning.
@article{VYURV_2017_6_3_a2,
     author = {A. V. Sozykin},
     title = {An overview of methods for deep learning in neural networks},
     journal = {Vestnik \^U\v{z}no-Uralʹskogo gosudarstvennogo universiteta. Seri\^a Vy\v{c}islitelʹna\^a matematika i informatika},
     pages = {28--59},
     year = {2017},
     volume = {6},
     number = {3},
     language = {ru},
     url = {http://geodesic.mathdoc.fr/item/VYURV_2017_6_3_a2/}
}
TY  - JOUR
AU  - A. V. Sozykin
TI  - An overview of methods for deep learning in neural networks
JO  - Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
PY  - 2017
SP  - 28
EP  - 59
VL  - 6
IS  - 3
UR  - http://geodesic.mathdoc.fr/item/VYURV_2017_6_3_a2/
LA  - ru
ID  - VYURV_2017_6_3_a2
ER  - 
%0 Journal Article
%A A. V. Sozykin
%T An overview of methods for deep learning in neural networks
%J Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika
%D 2017
%P 28-59
%V 6
%N 3
%U http://geodesic.mathdoc.fr/item/VYURV_2017_6_3_a2/
%G ru
%F VYURV_2017_6_3_a2
A. V. Sozykin. An overview of methods for deep learning in neural networks. Vestnik Ûžno-Uralʹskogo gosudarstvennogo universiteta. Seriâ Vyčislitelʹnaâ matematika i informatika, Tome 6 (2017) no. 3, pp. 28-59. http://geodesic.mathdoc.fr/item/VYURV_2017_6_3_a2/

[1] Y. LeCun, Y. Bengio, G. Hinton, “Deep Learning”, Nature, 521 (2015), 436–444 | DOI

[2] D. Ravi, C. h. Wong, F. Deligianni, et al., “Deep Learning for Health Informatics”, IEEE Journal of Biomedical and Health Informatics, 21:1 (2017), 4–21 | DOI

[3] J. Schmidhuber, “Deep Learning in Neural Networks: an Overview”, Neural Networks, 1 (2015), 85–117 | DOI

[4] W. S. McCulloch, W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity”, The Bulletin of Mathematical Biophysics, 5:4 (1943), 115–133 | DOI

[5] G. Hinton, R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks”, Science, 313:5786 (2006), 504–507 | DOI

[6] G. E. Hinton, S. Osindero, Y. W. Teh, “A Fast Learning Algorithm for Deep Belief Nets”, Neural Computing, 18:7 (2006), 1527–1554 | DOI

[7] J. Sima, “Loading Deep Networks Is Hard”, Neural Computation, 6:5 (1994), 842–850 | DOI

[8] D. Windisch, “Loading Deep Networks Is Hard: The Pyramidal Case”, Neural Computation, 17:2 (2005), 487–502 | DOI

[9] F. J. Gomez, J. Schmidhuber, “Co-Evolving Recurrent Neurons Learn Deep Memory POMDPs”, Proc. of the 2005 Conference on Genetic and Evolutionary Computation (GECCO) (Washington, DC, USA, June 25–29, 2005), 2005, 491–498 | DOI

[10] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber, “Deep, Big, Simple Neural Nets for Handwritten Digit Recognition”, Neural Computation, 22:12 (2010), 3207–3220 | DOI

[11] K. He, X. Zhang, S. Ren, et al., “Deep Residual Learning for Image Recognition”, 2016 IEEE Conference on Computer Vision and Pattern Recognition (Las Vegas, NV, USA, 27–30 June 2016), 2016, 770–778 | DOI

[12] D. E. Rumelhart, G. E. Hinton, J. L. McClelland, “A General Framework for Parallel Distributed Processing”, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1 (1986), 45–76 | DOI

[13] Y. LeCun, L. Bottou, G. B. Orr, “Efficient BackProp”, Neural Networks: Tricks of the Trade, 1998, 9–50 | DOI

[14] D. S. Broomhead, D. Lowe, “Multivariable Functional Interpolation and Adaptive Networks”, Complex Systems, 2, 321–355 | DOI

[15] M. N. Stone, “The Generalized Weierstrass Approximation Theorem”, Mathematics Magazine, 21:4 (1948), 167–184 | DOI

[16] A. N. Gorban, V. L. Dudin-Barkovsky, A. N. Kirdin, et al., Neuroinformatics, Science, Novosibirsk, 1998, 296 pp.

[17] K. Hornik, M. Stinchcombe, H. White, “Multilayer Feedforward Networks are Universal Approximators”, Neural Networks, 2:5 (1989), 359–366 | DOI

[18] H. N. Mhaskar, C. h. Micchelli, “Approximation by Superposition of Sigmoidal and Radial Basis Functions”, Advances in Applied Mathematics, 13:13 (1992), 350–373 | DOI

[19] D. O. Hebb, The Organization of Behavior, Wiley, New York, 1949, 335 pp. | DOI

[20] A. B. Novikoff, “On Convergence Proofs on Perceptrons”, Symposium on the Mathematical Theory of Automata, 12 (1962), 615–622

[21] F. Rosenblatt, “The Perceptron: a Probabilistic Model for Information Storage and Organization in the Brain”, Psychological Review, 1958, 65–386 | DOI

[22] B. Widrow, M. Hoff, “Associative Storage and Retrieval of Digital Information in Networks of Adaptive Neurons”, Biological Prototypes and Synthetic Systems, 1 (1962), 160 pp. | DOI

[23] K. S. Narendra, M. A. Thathatchar, “Learning Automata – a Survey”, IEEE Transactions on Systems, Man, and Cybernetics, 4 (1974), 323–334 | DOI

[24] F. Rosenblatt, Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms, Spartan Books, Washington, 1962, 616 pp. | DOI

[25] S. Grossberg, “Some Networks That Can Learn, Remember, and Reproduce any Number of Complicated Space-Time Patterns”, International Journal of Mathematics and Mechanics, 19 (1969), 53–91 | DOI

[26] T. Kohonen, “Correlation Matrix Memories”, IEEE Transactions on Computers, 100:4 (1972), 353–359 | DOI

[27] C. von der Malsburg, “Self-Organization of Orientation Sensitive Cells in the Striate Cortex”, Kybernetik, 14:2 (1973), 85–100 | DOI

[28] D. J. Willshaw, C. von der Malsburg, “How Patterned Neural Connections Can Be Set Up by Self Organization”, Proceedings of the Royal Society London, 194 (1976), 431–445 | DOI

[29] A. G. Ivakhnenko, “Heuristic Self-Organization in Problems of Engineering Cybernetics”, Automatica, 6:2 (1970), 207–219 | DOI

[30] A. G. Ivakhnenko, “Polynomial Theory of Complex Systems”, IEEE Transactions on Systems, Man and Cybernetics, 4 (1971), 364–378 | DOI

[31] S. Ikeda, M. Ochiai, Y. Sawaragi, “Sequential GMDH Algorithm and Its Application to River Flow Prediction”, EEE Transactions on Systems, Man and Cybernetics, 7, 473–479, 1976 | DOI

[32] M. Witczak, J. Korbicz, M. Mrugalski, et al., “A GMDH Neural Network-Based Approach to Robust Fault Diagnosis: Application to the DAMADICS Benchmark Problem”, Control Engineering Practice, 14:6 (2006), 671–683 | DOI

[33] T. Kondo, J. Ueno, “Multi-Layered GMDH-type Neural Network Self-Selecting Optimum Neural Network Architecture and Its Application to 3-Dimensional Medical Image Recognition of Blood Vessels”, International Journal of Innovative Computing, Information and Control, 4:1 (2008), 175–187

[34] S. Linnainmaa, The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors, University of Helsinki, 1970

[35] S. Linnainmaa, “Taylor Expansion of the Accumulated Rounding Error”, BIT Numerical Mathematics, 16:2 (1976), 146–160 | DOI

[36] P. J. Werbos, “Applications of Advances in Nonlinear Sensitivity Analysis”, Lecture Notes in Control and Information Sciences, 38 (1981), 762–770 | DOI

[37] D. B. Parker, Learning Logic. Technical Report TR-47, Center for Computational Research in Economics and Management Science, Massachusetts Institute of Technology, Cambridge, MA, 1985

[38] Y. LeCun, “A Theoretical Framework for Back-Propagation”, Proceedings of the 1988 Connectionist Models Summer School (Pittsburgh, Pennsylvania, USA, June 17–26, 1988), 1988, 21–28

[39] D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning Internal Representations by Error Propagation”, Parallel Distributed Processing, 1 (1986), 318–362 | DOI

[40] N. Qian, “On the Momentum Term in Gradient Descent Learning Algorithms”, Neural Networks: The Official Journal of the International Neural Network Society, 12:1 (1999), 145–151 | DOI

[41] J. Duchi, E. Hazan, Y. Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”, Journal of Machine Learning Research, 12 (2011), 2121–2159

[42] D. P. Kingma, J. L. Ba, “Adam: a Method for Stochastic Optimization”, International Conference on Learning Representations (San Diego, USA, May 7–9, 2015), 2015, 1–13

[43] K. Fukushima, “Neocognitron: a Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position”, Biological Cybernetics, 36:4 (1980), 193–202 | DOI

[44] D. H. Wiesel, T. N. Hubel, “Receptive Fields of Single Neurones in the Cat’s Striate Cortex”, The Journal of Physiology, 148:3 (1959), 574–591 | DOI

[45] K. Fukushima, “Artificial Vision by Multi-Layered Neural Networks: Neocognitron and its Advances”, Neural Networks, 37 (2013), 103–119 | DOI

[46] K. Fukushima, “Training Multi-Layered Neural Network Neocognitron”, Neural Networks, 40 (2013), 18–31 | DOI

[47] K. Fukushima, “Increasing Robustness Against Background Noise: Visual Pattern Recognition by a Neocognitron”, Neural Networks, 24:7 (2011), 767–778 | DOI

[48] D. H. Ballard, “Modular Learning in Neural Networks”, Proceedings of the Sixth National Conference on Artificial Intelligence (July 13–17, Seattle, Washington, USA), 1987, 279–284

[49] G. E. Hinton, J. L. McClelland, “Learning Representations by Recirculation”, Neural Information Processing Systems, 1998, 358–366

[50] D. H. Wolpert, “Stacked Generalization”, Neural Networks, 5:2 (1992), 241–259 | DOI

[51] K. M. Ting, I. H. Witten, “Stacked Generalization: When Does It Work?”, Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (Nagoya, Japan, August 23–29, 1997), 1997, 866–871

[52] Y. LeCun, B. Boser, J. S. Denker, et al., “Back-Propagation Applied to Handwritten Zip Code Recognition”, Neural Computation, 1:4 (1998), 541–551 | DOI

[53] Y. LeCun, B. Boser, J. S. Denker, et al., “Handwritten Digit Recognition with a Back-Propagation Network”, Advances in Neural Information Processing Systems 2. Morgan Kaufmann, 1990, 396–404

[54] P. Baldi, Y. Chauvin, “Neural Networks for Fingerprint Recognition”, Neural Computation, 5:3 (1993), 402–418 | DOI

[55] J. L. Elman, “Finding Structure in Time”, Cognitive Science, 14:2 (1990), 179–211 | DOI

[56] M. I. Jordan, Serial Order: a Parallel Distributed Processing Approach, Institute for Cognitive Science, University of California, San Diego. ICS Report 8604, 1986, 40 pp.

[57] M. I. Jordan, “Serial Order: a Parallel Distributed Processing Approach”, Advances in Psychology, 121 (1997), 471–495 | DOI

[58] S. Hochreiter, Untersuchungen zu Dynamischen Neuronalen Netzen. Diploma thesis, Institut fur Informatik, Lehrstuhl Prof. Brauer. Technische Universitat Munchen, 1991

[59] S. Hochreiter, Y. Bengio, P. Frasconi, et al., “Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies”, A Field Guide to Dynamical Recurrent Neural Networks, 2001, 237–243, Wiley-IEEE Press | DOI

[60] Y. Bengio, P. Simard, P. Frasconi, “Learning Long-Term Dependencies with Gradient Descent is Difficult”, IEEE Transactions on Neural Networks, 5:2 (1994), 157–166 | DOI

[61] P. Tino, B. Hammer, “Architectural Bias in Recurrent Neural Networks: Fractal Analysis”, Neural Computation, 15:8 (2004), 1931–1957 | DOI

[62] S. Hochreiter, J. Schmidhuber, “Bridging Long Time Lags by Weight Guessing and “Long Short-Term Memory””, Spatiotemporal Models in Biological and Artificial Systems, 37 (1996), 65–72

[63] J. Schmidhuber, D. Wierstra, M. Gagliolo, et al., “Training Recurrent Networks by Evolino.”, Neural Computation, 19:3 (2007), 757–779 | DOI

[64] L. A. Levin, “Universal Sequential Search Problems”, Problems of Information Transmission, 9:3 (1997), 265–266

[65] J. Schmidhuber, “Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability”, Neural Networks, 10:5 (1997), 857–873 | DOI

[66] M. F. Moller, Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in O(N) Time, No. PB-432, Computer Science Department, Aarhus University, Denmark, 1993 | DOI

[67] B. A. Pearlmutter, “Fast Exact Multiplication by the Hessian”, Neural Computation, 6:1 (1994), 147–160 | DOI

[68] N. N. Schraudolph, “Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent”, Neural Computation, 14:7 (2002), 1723–1738 | DOI

[69] J. Martens, “Deep Learning via Hessian-Free Optimization”, Proceedings of the 27th International Conference on Machine Learning (ICML-10) (Haifa, Israel, June 21–24, 2010), 2010, 735–742

[70] J. Martens, I. Sutskever, “Learning Recurrent Neural Networks with Hessian-Free Optimization”, Proceedings of the 28th International Conference on Machine Learning (ICML-11) (Bellevue, Washington, USA, June 28–July 02, 2011), 2011, 1033–1040

[71] J. Schmidhuber, “Learning Complex, Extended Sequences Using the Principle of History Compression”, Neural Computation, 4:2 (1992), 234–242 | DOI

[72] J. Connor, D. R. Martin, L. E. Atlas, “Recurrent Neural Networks and Robust Time Series Prediction”, IEEE Transactions on Neural Networks, 5:2 (1994), 240–254 | DOI

[73] G. Dorffner, “Neural Networks for Time Series Processing”, Neural Network World, 6 (1996), 447–468

[74] J. Schmidhuber, M. C. Mozer, D. Prelinger, “Continuous History Compression”, Proceedings of International Workshop on Neural Networks (Aachen, Germany, 1993), 1993, 87–95

[75] S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory”, Neural Computation, 9:8 (1997), 1735–1780 | DOI

[76] F. A. Gers, J. Schmidhuber, F. Cummins, “Learning to Forget: Continual Prediction with LSTM”, Neural Computation, 12:10 (2000), 2451–2471 | DOI

[77] J. A. Perez-Ortiz, F. A. Gers, D. Eck, et al., “Kalman Filters Improve LSTM Network Performance in Problems Unsolvable by Traditional Recurrent Nets”, Neural Networks, 16:2 (2003), 241–250 | DOI

[78] J. Weng, N. Ahuja, T. S. Huang, “Cresceptron: a Self-Organizing Neural Network Which Grows Adaptively”, International Joint Conference on Neural Networks (IJCNN) (Baltimore, USA, 7–11 June 1992), 1992, 576–581 | DOI

[79] J. J. Weng, N. Ahuja, T. S. Huang, “Learning Recognition and Segmentation Using the Cresceptron”, International Journal of Computer Vision, 25:2 (1997), 109–143 | DOI

[80] M. A. Ranzato, F. J. Huang, Y. L. Boureau, et al., “Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition”, IEEE Conference on Computer Vision and Pattern Recognition (Minneapolis, MN, USA, 17–22 June 2007), 2007, 1–8 | DOI

[81] D. Scherer, A. Muller, S. Behnke, “Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition”, Lecture Notes in Computer Science, 6354 (2010), 92–101 | DOI

[82] P. Smolensky, “Information Processing in Dynamical Systems: Foundations of Harmony Theory”, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1 (1986), 194–281

[83] G. E. Hinton, T. E. Sejnowski, “Learning and Relearning in Boltzmann Machines”, Parallel Distributed Processing, 1 (1986), 282–317

[84] R. Memisevic, G. E. Hinton, “Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines”, Neural Computation, 22:6 (2010), 1473–1492 | DOI

[85] A. Mohamed, G. E. Hinton, “Phone Recognition Using Restricted Boltzmann Machines”, IEEE International Conference on Acoustics, Speech and Signal Processing (Dallas, TX, USA, 14–19 March 2010), 2010, 4354–4357 | DOI

[86] R. Salakhutdinov, G. Hinton, “Semantic Hashing”, International Journal of Approximate Reasoning, 50:7 (2009), 969–978 | DOI

[87] Y. Bengio, P. Lamblin, D. Popovici, et al., “Greedy Layer-Wise Training of Deep Networks”, Advances in Neural Information Processing Systems 19, 2007, 153–160

[88] P. Vincent, L. Hugo, Y. Bengio, et al., “Extracting and Composing Robust Features with Denoising Autoencoders”, Proceedings of the 25th international Conference on Machine learning (Helsinki, Finland, July 05–09, 2008), 2008, 1096–1103. | DOI

[89] D. Erhan, Y. Bengio, A. Courville, et al., “Why Does Unsupervised Pre-Training Help Deep Learning?”, Journal of Machine Learning Research, 11 (2010), 625–660

[90] I. Arel, D. C. Rose, T. P. Karnowski, “Deep Machine Learning – a New Frontier in Artificial Intelligence Research”, Computational Intelligence Magazine, IEEE, 5:4 (2010), 13–18 | DOI

[91] J. Viren, S. Sebastian, “Natural Image Denoising with Convolutional Networks”, Advances in Neural Information Processing Systems (NIPS) 21, 2009, 769–776

[92] A. Sh. Razavian, H. Azizpour, J. Sullivan, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (Washington, DC, USA, June 23–28, 2014), 2014, 512–519 | DOI

[93] W. Ruochen, X. Zhe, “A Pedestrian and Vehicle Rapid Identification Model Based on Convolutional Neural Network”, Proceedings of the 7th International Conference on Internet Multimedia Computing and Service (ICIMCS 15) (Zhangjiajie, China, August 19–21, 2015), 2015, 1–4 | DOI

[94] L. Boominathan, S. S. Kruthiventi, R. V. Babu, “CrowdNet: A Deep Convolutional Network for Dense Crowd Counting”, Proceedings of the 2016 ACM on Multimedia Conference (Amsterdam, The Netherlands, October 15–19, 2016), 2016, 640–644 | DOI

[95] A. Kinnikar, M. Husain, S. M. Meena, “Face Recognition Using Gabor Filter And Convolutional Neural Network”, Proceedings of the International Conference on Informatics and Analytics (Pondicherry, India, August 25–26, 2016), 2016 | DOI

[96] R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, et al., “Digital Selection and Analogue Amplification Coexist in a Cortex-Inspired Silicon Circuit”, Nature, 405 (2000), 947–951 | DOI

[97] R. H. Hahnloser, H. S. Seung, J. J. Slotine, “Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks”, Neural Computation, 15:3 (2003), 621–638 | DOI

[98] X. Glorot, A. Bordes, Y. Bengio, “Deep Sparse Rectifier Neural Networks”, Journal of Machine Learning Research, 15 (2011), 315–323

[99] X. Glorot, Y. Bengio, “Understanding the Difficulty of Training Deep Feedforward Neural Networks”, Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10) (Sardinia, Italy, May 13–15, 2010 publ Society for Artificial Intelligence and Statistics), 2010, 249–256

[100] K. He, X. Zhang, S. h. Ren, et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (Santiago, Chile, December 7–13, 2015), 2015, 1026–1034 | DOI

[101] S. Ioffe, C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, JMLR Workshop and Conference Proceedings. Proceedings of the 32nd International Conference on Machine Learning (Lille, France, July 06–11, 2015), 2015, 448–456

[102] C. Szegedy, W. Liu, Y. Jia, et al., “Going Deeper with Convolutions”, IEEE Conference on Computer Vision and Pattern Recognition (Boston, MA, USA, June 7–12, 2015), 2015, 1–9 | DOI

[103] C. Szegedy, V. Vanhoucke, S. Ioffe, et al., “Rethinking the Inception Architecture for Computer Vision”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Seattle, WA, USA, Jun 27–30, 2016), 2016, 2818–2826 | DOI

[104] C. Szegedy, S. Ioffe, V. Vanhoucke, et al., “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) (San Francisco, California, USA, February 4–9, 2017), 2017, 4278–4284

[105] K. Cho, B. van Merrienboer, C. Gulcehre, et al., “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Doha, Qatar, October 25–29, 2014), 2014, 1724–1734 | DOI

[106] K. Cho, B. van Merrienboer, D. Bahdanau, et al., “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (Doha, Qatar, October 25, 2014), 2014, 103–111 | DOI

[107] J. Chung, C. Gulcehre, K. Cho, et al., “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, NIPS 2014 Workshop on Deep Learning (Montreal, Canada, December 12, 2014), 2014, 1–9

[108] K. He, J. Sun, “Convolutional Neural Networks at Constrained Time Cost”, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Boston, MA, USA, June 07–12, 2015), 2015, 5353–5360 | DOI

[109] Y. Jia, E. Shelhamer, J. Donahue, et al., “Caffe: Convolutional Architecture for Fast Feature Embedding”, Proceedings of the 22nd ACM International Conference on Multimedia (Orlando, FL, USA, November 03–07, 2014), 2014, 675–678 | DOI

[110] D. Kruchinin, E. Dolotov, K. Kornyakov, et al., “Comparison of Deep Learning Libraries on the Problem of Handwritten Digit Classification”, Analysis of Images, Social Networks and Texts. Communications in Computer and Information Science, 542 (2015), 399–411 | DOI

[111] S. Bahrampour, N. Ramakrishnan, L. Schott, et al., Comparative Study of Deep Learning Software Frameworks } {\tt https://arxiv.org/abs/1511.06435

[112] J. Bergstra, O. Breuleux, F. Bastien, et al., “Theano: a CPU and GPU Math Expression Compiler”, Proceedings of the Python for Scientific Computing Conference (SciPy) (Austin, TX, USA, June 28–July 3, 2010), 2010, 3–10

[113] M. Abadi, A. Agarwal, P. Barham, “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA, USA, November, 2–4, 2016), 2016, 265–283

[114] R. Collobert, K. Kavukcuoglu, C. Farabet, “Torch7: a Matlab-like Environment for Machine Learning”, BigLearn, NIPS Workshop (Granada, Spain, December 12–17, 2011), 2011

[115] F. Seide, A. Agarwal, “CNTK: Microsoft’s Open-Source Deep-Learning Toolkit”, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 16) (San Francisco, California, USA, August 13–17, 2016), 2016, 2135–2135 | DOI

[116] A. Viebke, S. Pllana, “The Potential of the Intel Xeon Phi for Supervised Deep Learning”, IEEE 17th International Conference on High Performance Computing and Communications (HPCC) (New York, USA, August 24–26, 2015), 2015, 758–765 | DOI

[117] F. Chollet, Keras 2015 } {\tt https://github.com/fchollet/keras

[118] PadlePadle: PArallel Distributed Deep LEarning } {\tt http://www.paddlepaddle.org/

[119] T. Chen, M. Li, Y. Li, MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems } {\tt https://arxiv.org/abs/1512.01274

[120] Intel Nervana Reference Deep Learning Framework Committed to Best Performance on all Hardware } {\tt https://www.intelnervana.com/neon/

[121] S. h. Shi, Q. Wang, P. Xu, Benchmarking State-of-the-Art Deep Learning Software Tools } {\tt https://arxiv.org/abs/1608.07249

[122] K. Weiss, T. M. Khoshgoftaar, D. Wang, “A Survey of Transfer Learning”, Journal of Big Data, 3:1 (2016), 1–9 | DOI

[123] J. Ba, V. Mnih, K. Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, Proceedings of the International Conference on Learning Representations (ICLR) (San Diego, USA,May 7–9, 2015), 2015, 1–10

[124] A. Graves, A. R. Mohamed, G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks”, IEEE International Conference on Acoustics, Speech and Signal Processing (Vancouver, Canada, May 26–31, 2013), 2013, 6645–6649 | DOI