An Introduction to Deep Learning

Thought Leadership Paper Predictive Analytics An Introduction to Deep Learning Examining the Advantages of Hierarchical Learning

Table of Contents 4 The Emergence of Deep Learning 7 Applying Deep-Learning Techniques 10 Scaling Deep-Learning Algorithms 12 Conclusion Dr. Ying Wu, PhD, is a data scientist within the Advanced Analytics organization at SAP. With more than 10 years of research experience in artificial intelligence, Dr. Wu mainly focuses on designing and applying a wide range of machine learning techniques in data mining, as well as providing solutions for integrating predictive analytics into innovations from SAP. Before joining SAP, Dr. Wu served as a researcher at University College Cork (UCC), Ireland, for six years. During his tenure at UCC, he researched primarily artificial intelligence in data integration and data mining and was involved in projects funded by the European Union Framework Program (FP7), European Space Agency, Irish Environment Protection Agency, and Geological Survey of Ireland. Dr. Wu has published more than 15 research papers in the area of artificial intelligence. He also received a master s degree with distinction in information technology from the University of Paisley and a PhD in artificial intelligence and data mining from the University of West of Scotland, UK. Dr. Rouzbeh Razavi, PhD, is a data scientist within the Advanced Analytics R&D organization at SAP. In his current role, Dr. Razavi is responsible for providing expertise in areas related to machine learning, data mining practices and design, and implementation of advanced algorithms. Prior to joining SAP, he served as a senior scientist at Bell Laboratories, Alcatel-Lucent, for over five years. At Bell Labs, Dr. Razavi introduced a number of innovations with significant business and scientific impact. He has been the recipient of a number of awards including the prestigious Bell Labs Golden Pen Award. Before joining Bell Labs, he was a research fellow at the University of Essex for three years. Dr. Razavi has published more than 70 technical papers, invented more than 25 patent applications, and authored five books and book chapters. He received both a master s degree in information systems and a PhD in computer science from the University of Essex, UK. He also received a second master s degree in business analytics from the University College of Dublin. 2 / 12

Deep learning is taking the academic community and business world by storm. This machine learning approach is powering the latest generation of commodity computing and deriving significant value from Big Data. But most important, it is radically changing how computers recognize speech, identify objects in images, and recall and process information three of the fundamental building blocks for artificial intelligence. 3 / 12

The Emergence of Deep Learning For over a decade, computer science has completely changed nearly every aspect of our lives. Once thought as an unrealistic dream, artificial intelligence (AI) has finally come to fruition enabling computers to understand and interact with us while processing their own thinking. Over the years, there s been much research done on AI methods. For example, machine learning is applying the concept of artificial neural networks (ANNs), a family of statistical learning algorithms inspired by biological neural networks similar to the inner workings of a human s brain. This approach is used to estimate or approximate functions that depend on a large number of inputs and that are generally unknown. ANNs are commonly represented as systems of interconnected neurons that can compute values from inputs and are capable of machine learning and pattern recognition, thanks to their adaptive nature. Despite their popularity and diverse variety of applications, neural network models and other machine learning methods typically contain a shallow architecture of two or three levels. Researchers reported positive results on a wide range of applications with two or three layers; however, training deeper networks yielded lesspromising results. In addition, they revealed that multilayer neural networks with more than two hidden layers have a marginal impact on operations while requiring a significant increase of training time. Why? Many believe that those earlier hidden layers in a multilayer architecture are placed too far away from the output. As a result, when considering learning through back propagation where the source of learning is the output, such layers are not very effective and are more influenced by initial random setting. Yoshua Bengio and Yann LeCun observed that, in most classical machine learning methods with a large number of parameters to consider, optimal learning can only be achieved when some form of prior knowledge is available. 1 Moreover, when the problem is expressed by complex behaviors, highly varying mathematical functions are usually required to solve it. These mathematical functions are highly nonlinear in the data space and can display a very large number of variations. With a shallow architecture for the highly varying functions, the learning algorithms are greatly impacted by the number of dimensions in the problem and are very prone to suboptimal performance. In recent years, deep architecture motivated by biological and circuit complexity theories has been reported to be more efficient than shallow architectures, especially when the problem is assumed to have complex behaviors with highly varying mathematical functions. These deeplearning networks are usually presented with multiple hidden layers. The hidden layer is where the network stores its internal abstract representation of the training data. In deep learning, the hidden layers are computed in an entirely different fashion when compared to traditional neural networks. More specifically, each layer in a deep network is pretrained with an unsupervised learning algorithm, resulting in a nonlinear transformation of its input or the output of the previous layer and capture of more abstracted features from its input. Then in the final training phase, the deep architecture is fine-tuned with respect to a supervised training criterion with gradient-based optimization. 1. Bengio, Y., and LeCun, Y., In Large Scale Kernel Machines, MIT Press, 2007. 4 / 12

The concept of deep learning is designed to train features at higher levels by applying the composition of lower-level features. As Bengio and LeCun proposed, deep-learning networks can automatically discover abstractions from lower-level features to higher-level concepts through a series of processing stages. 2 This is where lower-level abstractions are more tightly related to pieces of data and higherlevel abstractions are more directly tied to actual and meaningful concepts. One advantage of using such deep architecture is how a different level of abstraction focuses on a small subset of a large number of features. Although the information to be learned is not located in a single layer of neurons, it is distributed across multiple layers. Such a distributed representation allows deep-learning networks to have a stronger capacity for learning and can produce much better generalizations when compared to the traditional machine-learning methods. Furthermore, an architecture with multiple levels and based on a distributed representation of data allows deep-learning networks to learn intermediate representations, which can be shared across different problem areas. This means that knowledge learned as intermediate representations is reusable, where new high-level features can be learned by combining lower-level intermediate features from a common pool of information. A large body of literature has been focused on deep-learning methods. Almost 10 years ago, Geoffrey E. Hinton and his team presented the concept of deep belief networks (DBNs). 3 In 2007 deep neural networks based on autoencoders was proposed by a study conducted by Bengio and his team. 4 However, not all deep-learning methods were derived after 2006. For example, another neural network model with a deep architecture, the convolutional neural network (CNN), was introduced by LeCun in 1998. But, it is also important to note that much research has been done since 2006 to extend the CNN framework. For instance, the CNN has been applied to restricted Boltzmann machines (RBMs) and DBN. 5 On the other hand, the unsupervised pretraining step of deep learning is also applied to the CNN. 6 2. Ibid. 3. Hinton, G., and Salakhutdinov, R., Reducing the Dimensionality of Data with Neural Networks, Science, 2006. 4. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H., Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems, 2007. 5. Lee, H., Grosse, R., Ranganath, R., and Andrew, Y. N., Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, ICML, 2009. 6. Kavukcuoglu, K., Ranzato, M. A., Fergus, R., and LeCun, Y., Learning Invariant Features Through Topographic Filter Maps, CVPR, 2009. 5 / 12

In summary, Bengio and LeCun have listed some advantages of using a deep architecture, such as the ability to: Learn complex, highly varying functions Analyze low-level, intermediate, and high-level abstractions, with little human input Process a very large set of examples Assess mostly unlabeled data Exploit synergies presented across a large number of tasks 7 In terms of popularity, deep architecture has gained significant attention in recent years. See Figure 1 for an illustration on the evolution and popularity of different machine learning algorithms, including the emerging deep-learning methods over the years. Figure 1: Evolution and popularity of machine learning algorithms from 1960 to the present day SVM Subjective popularity Decision tree, ID3 Random forests Adaboost Perceptron (large scale) Deep learning Neural networks 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 SVM = Support vector machine 7. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H., Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems, 2007. 6 / 12

Applying Deep-Learning Techniques Since 2006 deep architectures have been enabling state-of-the-art performance. And with success, this technology has been applied across a wide range of fields such as classification, dimensionality reduction, robotics, image recognition, image retrieval, information retrieval, language modeling, and natural language processing. The DBN and stack autoencoder were originally demonstrated with success on the Mixed National Institute of Standards and Technology (MNIST) data set as an image-recognition task. 8 Recently, some image classification models based on deep architectures have reported state-of-the-art performance on this data set. According to a study conducted by Dan Ciresan, Ueli Meier, and Jurgen Schmidhuber, the convolutional neural network is built and trained, reporting a very low 0.23% error rate. 9 In this section, we will provide a brief overview of applications where deep learning is successfully applied. For more information on this topic, we strongly encourage you to read Deep Learning: Methods and Applications. (Foundations and Trends in Signal Processing) by Li Deng and Dong Yu. 10 MULTIMEDIA SIGNAL PROCESSING Traditionally, multimedia signal processing has been an active area where machine learning algorithms have been applied. This includes areas related to image recognition, classification, and retrieval. The origin of applying deep learning to object-recognition tasks can be traced to CNNs in the early 1990s. However, the introduction of deep learning has resulted in a paradigm shift in field-object recognition and classification. The fundamental principle of deep learning is the ability to autonomously generate high-level representations from raw data sources. Therefore, it is evident that deep learning complements the area of image recognition and classification. In other words, the raw data is fed into the first layer and higher-level features are extracted and passed to the next layer subsequently until the eventual output (such as a prediction) is produced. In a study where deep architectures were used along with convolution structures when processing computer vision and image recognition, it was reported that the deep CNN approach achieved a considerably lower error rate than other state-ofthe-art machine learning ever used. 11 This work was the output of a training data set that contained 1,000 unique image classes as the targets, 1.2 million high-resolution images in the training set, and 150,000 images in the test data set. Machine learning has been also successfully applied to speech and audio signal processing. In this context, the goal is condensed to the use of primitive spectral and waveform features while such features were traditionally handcrafted. Experimental results validate the superiority of deep-learning methods for speech recognition, especially in the presence of noise. Developments and features such as Siri by Apple, Google Now, Nokia Cortana, and Baidu Deep Speech are some examples of commercial products relying on such advancements in speech processing. 8. MNIST data set, http://yann.lecun.com/exdb/mnist. 9. Ciresan, D., Meier, U., and Schmidhuber, J., Multi-Column Deep Neural Networks for Image Classification, Arxiv preprint, 2012. 10. Deng, L., and Yu, D., Deep Learning: Methods and Applications. (Foundations and Trends in Signal Processing), New Publishers Inc., 2014. 11. Alex, K., Sutskever, I., and Hinton, G., ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems, pages 1 9, 2012. 7 / 12

SEARCH ENGINES AND INFORMATION RETRIEVAL During information retrieval, a user submits queries to a system that contains a collection of many documents and objects with the goal of obtaining a set of relevant documents and objects. Web search engines are commonly held as the most important category of information retrieval service providers. Traditionally, search engines retrieve Web-based documents by matching terms in documents with those in a search query a process called lexical matching. However, lexical matching can be suboptimal due to a language discrepancy between documents and queries. Many practitioners are looking at semantic matching as an approach to close this gap. Latent semantic analysis (LSA) and its extensions are mature concepts that were introduced 25 years ago. However, a new trend has now started, which deploys deep neural networks to extract high-level semantic representations. This explains why search engine giants such as Google, Microsoft, Yahoo, and Baidu are significantly investing in this area. As for image documents, content-based image retrieval searches for images are performed according to the visual content of those images. Deep-learning techniques have been widely applied in this area in recent years. D.W. Ji Wan proposed a deep-learning framework as shown in Figure 2. 12 The model was successfully trained on the ILSVRC-2012 data set from ImageNet and resulted in state-of-the-art performance with 1,000 categories and more than 1 million training images. Figure 2: A deep-learning framework for image retrieval Massive Source Image in Various Categories (For Example, ImageNet ILSVRC2012) Using CNN Model on Other Image Data Sets New image retrieval data set 1 New image retrieval data set 2 New image retrieval data set n Low level Midlevel High level Feature Representation in CBIR Convolutional neural network Input raw RBG image Convolutional filtering Local contract norm and sample pooling Loops for high-level feature (normalization and pooling are optional) Fully connected layer (FC1) Fully connected layer (FC2) Final output labels (FC3) Scheme I: Directly use the features representation from layers FC1, FC2, FC3 based on ImageNet-trained model Scheme II: Adopt metric learning technique to refine the feature representation achieved from Scheme I Scheme III: Retrain a deep CNN model with classification or similarity loss function, which is initialized by the ImageNet-trained model Feature representation output for content-based image retrieval CNN = Convolutional neural network 12. Ji Wan, D. W., Deep Learning for Content-Based Image Retrieval: A Comprehensive Study, Proceedings of ACM Multimedia Conference (MM2014), 2014. 8 / 12

LANGUAGE MODELING AND NATURAL LANGUAGE PROCESSING Research in language modeling and processing has recently gained significant popularity. The goal of a language model is to estimate the distribution of natural language as accurately as possible. Natural language processing (NLP), or computational linguistics, also deals with word sequences; however, the tasks are much more diverse. Deep learning has shown to be very promising for both language modeling and NLP. For example, the long short-term memory (LSTM) architecture has been applied in machine translation. 13, 14 The WMT 14 English to French data set is used to evaluate this architecture, and the models are trained on a subset of 12 million sentences consisting of 348 million French words and 304 million English words. The deep LSTM architecture is built with four layers with 1,000 cells at each layer and 1,000 dimensional words embedded. The input vocabulary consists of 160,000 words, and there are 80,000 words in the output vocabulary. As a result of this study, it was determined that deep LSTMs significantly outperform shallow LSTMs, especially when the complexity of each additional layer is reduced by nearly 10%. 15 13. Sutskever, I., Vinyals, O., and Le, Q. V., Sequence to Sequence Learning with Neural Networks, e-print arxiv:1409.3215, 2014. 14. Hochreiter, S. S., Long Short-Term Memory, Neural Computation, 1997. 15. Sutskever, I., Vinyals, O., and Le, Q. V., Sequence to Sequence Learning with Neural Networks, e-print arxiv:1409.3215, 2014. 9 / 12

Scaling Deep-Learning Algorithms While deep learning brings impressive advantages in many applications, the training of deep-learning models is not straightforward when the volume of data is very large. This is due to the fact that iterative computations inherent in most deeplearning methods are difficult to be parallelized. In recent years, there has been much research in effective and scalable parallel algorithms for training of deep learning. For instance, DistBelief is a software framework recently designed for distributed training and learning in deep networks with very large models and large-scale data sets. 16 It uses large-scale clusters of machines to manage data and model parallelism through multithreading, message passing, synchronization, and communication between machines. The large network architecture is firstly partitioned into smaller parallel structures, whose nodes are assigned and calculated in several machines. 2. A locally connected convolutional neural network with 16 million images of 100x100 pixels, 21,000 categories, and as many as 1.7 billion parameters The experimental results show that locally connected learning models benefit more from DistBelief since the method is 12 times faster than using a single machine. An alternative way to train deep-learning models is by using GPUs. In August 2013 NVIDIA single precision GPUs exceeded 4.5 TeraFLOP/s with a memory bandwidth of near 300 GB/s. This shows that GPUs are particularly suited for massively parallel computing with more transistors devoted for data proceeding needs. 18 Figure 3: Models partitioned into four blocks and assigned to four machines 17 As shown in Figure 3, there are four blocks partitioned and each is assigned to a single machine. The boundary nodes require data transfers between the machines. Block 1 Block 2 As a result, DistBelief is implemented into two deep-learning models: 1. A fully connected network with 42 million model parameters and 1.1 billion examples Block 3 Block 4 16. Lin, X.-W., and Chen, X., Big Data Deep Learning: Challenges and Perspectives, Digital Object Identifier, 2014. 17. Dean, J., Large-Scale Distributed Deep Networks, Proceedings: Active Neural Information Processing Systems, 2012. 18. Lin, X.-W., and Chen, X., Big Data Deep Learning: Challenges and Perspectives, Digital Object Identifier, 2014. 10 / 12

A couple years ago, Adam Coats, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng proposed the commodity off-the-shelf, high-performance computing (COTS HPC) system for training deep network models with more than 11 billion free parameters by using just three machines. 19 The COTS HPC system comprises a cluster of 16 GPU servers with an Infiniband adapter for interconnects and MPI for data exchange in a cluster. Each server is equipped with four NVIDIA GTX680 GPUs, each having 4 GB of memory. Refer to Figure 4 for a summary of research efforts dedicated toward experimentation with GPUs. In addition, it is worth mentioning Deeplearning4j the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. 21 Integrated with Hadoop and Spark, Deeplearning4j is designed for business environments and includes a distributed multithreaded deep-learning framework and a single-threaded deep-learning framework. Training takes place in the cluster, which means it can process massive amounts of data. Networks are trained in parallel through iterative deduction, and they are equally compatible with Java, Scala, and Clojure. Figure 4: Recent research progress in large-scale deep learning 20 Methods Computing power Size of models Average running time DBN NVIDIA GTX 280 GPU with 1 million images 1 day 1 GB of memory 100 million parameters CNN 2 GTX 580 GPUs with 1.2 million images (256x256) 5 6 days 3 GB of memory 60 million parameters DistBelief 1,000 CPUs with down power 1.1 billion audio examples 16 hours SGD with Adagrad 42 million parameters Sparse 1,000 CPUs with 10 million images (200x200) 3 days autoencoder 16,000 cores 1 billion parameters COTS HPC 64 NVIDIA GTX 680 GPUs, each 10 million images (200x200) 3 days with 4 GB of memory 11 billion parameters 19. Coats, A., Huval, B., Wang, T., Wu, D., and Wu, A., Deep Learning with COTS HPS Systems, Journal of Machine Learning Research, 2013. 20. Lin, X.-W., and Chen, X., Big Data Deep Learning: Challenges and Perspectives, Digital Object Identifier, 2014. 21. Deeplearning4j, http://deeplearning4j.org, 2015. 11 / 12

Conclusion In recent years, deep-learning techniques have received much attention in the academic community as well as global industries. Deep learning allows distributed representation of data. Doing so allows the data to be configured into abstract features that are automatically captured and compactly represented across the hidden layers across the network. As a result, a system with deep architecture can still show a strong learning capacity while opening the door to a rich form of generalization, even if the problem being solved contains complex behaviors and highly varying mathematical functions. Deep-learning techniques can be applied across a wide range of domains and result in state-ofthe-art performance. On the other hand, deep learning is not a stand-alone research area rather, it is closely related to Big Data techniques. When the application is large scale and involves a huge volume of data for training, the architecture of deep learning is usually designed to be complicated and requires high-performance computation. For an overview and concise discussion of bringing deep-learning algorithms into SAP Predictive Analytics software and using them for modeling, please refer to the thought leadership paper, Embed Deep-Learning Techniques into Predictive Modeling. Studio SAP 38002enUS (15/05) 12 / 12

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. Please see http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for additional trademark information and notices. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary. These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP SE or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP SE or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty. In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation, and SAP SE s or its affiliated companies strategy and possible future developments, products, and/or platform directions and functionality are all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as of their dates, and they should not be relied upon in making purchasing decisions.