Artificial neural networks and deep learning

February 20, 2015

1 Introduction Artificia Neura Networks (ANNs) are a set of statistica modeing toos originay inspired by studies of bioogica neura networks in animas, for exampe the brain and the centra nervous system. Concentrated research on artificia neura networks began in 1943, foowing work by McCuoh and Pitts, who showed the abiity of neura networks to compute arithmetic and ogica functions (Basheer and Hameer, 2000). ANNs have a wide variety of appications spanning a range of fieds. Part of the reason for this is due to their fexibiity, ANNs can be used to accompish a wide variety of statistica tasks incuding cassification, custering, regression and forecasting. ANNs are universa function approximators, meaning that they can approximate any function to a desired degree of accuracy. The use of ANNs in function approximation has been considered extensivey, particuary in microbioogy (Basheer and Hameer, 2000). A particuary important deveopment in the study of ANNs was buiding invariance to various transformations of the observed data into the structure of an ANN itsef (LeCun et a., 1998). This technique has been used extensivey for pattern recognition tasks in machine earning, such as in image anaysis and speech recognition. More recenty, with increases in computing power, ANNs of this type set a new benchmark in image cassification, as they became the first agorithms to achieve human-competitive performance. For a ong time ANNs had to be imited in size, due to difficuties in training. However Hinton et a. (2006) proposed a procedure which coud be used with much arger networks. This deveopment caused an increased interest in ANN research, as modeing with arger networks aowed more sophisticated modes to be buit. This report outines key definitions in the study of ANNs, aong with basic training procedures for the modes. Particuary important techniques that have been introduced more recenty, especiay among the machine earning community, are then discussed. 2 Definition We focus on a particuar cass of ANNs which have been most appied in the iterature. They are known as feedforward artificia neura networks. A feedforward ANN consists of ayers of units. There are generay one or more hidden ayers and one output ayer. Let us pick a particuar unit ocated in the k th hidden ayer with input variabes z (k 1) 1,..., z (k 1). Then the input to this unit wi consist of a inear combination of the input variabes D a (k) = w (k) z (k 1) + w (k) 0, (2.1) where w is the weight vector for this unit, and w 0 is a constant known as the bias for this unit. The input a is then transformed by a function h(.) to give the activation z of the unit z (k+1) = h(a (k) ). (2.2) This activation is then passed as an input to the next ayer of units. We define z (0) := x, where x is an observed vector. Generay we choose h(.) to be sigmoida. y W (2) z (1) W (1) x Output Hidden Input Figure 1: Diagram of a singe-hidden ayer feedforward neura network. 1

The units in the fina hidden ayer have the same structure as the hidden units defined in (2.1), (2.2), however their activations are passed as outputs y of the network. Feedforward simpy indicates that the activations of each unit is sent ony to the next ayer, not to a ayer behind it. Figure 1 depicts a singe-hidden ayer feedforward network. 3 Parameter optimization Due to the compexity of ANNs, the chosen error function E(W) (often found using the ikeihood function), can rarey be minimised anayticay. This function is aso rarey convex, so generay a good oca minima is found instead of the goba minima. For shaow networks, these can generay be found by running a numerica agorithm such as gradient descent mutipe times at different starting points. For arge datasets however, these methods take a ong time to perform each iteration. An aternative procedure, which performs better for arge datasets, is known as stochastic gradient descent. This method does not use a the data at each iteration, instead using a sampe of the data. In order to use these gradient based methods however, we require an efficient way to cacuate E(W) for a feedforward neura network. Two methods are outined in the next two sections. 3.1 Error backpropagation Error backpropagation provides us with a computationay efficient method for evauating derivatives for a broad cass of error functions of an ANN with respect to the weights. Generay an error function E(W), such as east squares error, comprises of a sum of terms for each data point. Ca each term E n (W). Then WLOG, we derive the evauation of one term E n (W), as in (Bishop, 2006), and drop the subscript n to avoid cuttered notation. Consider the derivative of E with respect to a weight w (k) i. Generay E depends on w(k) i ony via a (k). Assuming this is true then the chain rue aows us to write E w (k) i = E w (k) i = δ (k) z (k 1) i, (3.1) where z (k 1) i is obtained using (2.1) and we have defined δ (k) := E. (3.2) By using the chain rue again we obtain δ (k 1) = E a (k 1) = E a (k 1), where the index runs over a units to which unit sends connections to. Finay by making use of equations (2.2), (2.1) and the definition of δ we obtain the backpropagation formua ( ) δ (k 1) = h a (k 1) w (k) δ (k). (3.3) This aows us to write down the error backpropagation procedure given L 1 hidden ayers as foows: 1. Find the activations of a units for a given input vector x n using equations (2.1) and (2.2). 2. Evauate δ (L) for each output unit using (3.2). 3. Obtain δ (k) for each hidden unit using (3.3). 2

4. Use (3.1) to evauate the required derivatives. Error propagation can be extended to find the Hessian of the error and associated approximations. This aows more efficient agorithms such as quasi-newton methods to be used to minimise the error function. 3.2 Extreme earning machines Whie error backpropagation was a maor breakthrough in the study of neura networks, the agorithm was often quite sow, and woud sometimes get stuck at poor oca minima. However for singe-hidden ayer feedforward ANNs a new agorithm was deveoped, as described in Huang et a. (2006). This is significanty faster than traditiona error backpropagation agorithms and anayticay determines the output weights. Suppose we have a singe output function whose activation is inear, giving ( where h(a (1) ) = ) h(a (1) 1 ),..., h(a(1) L ). It was shown by Huang et a. (2006) that the input weights w (1) y(x; W) = w (2) h(a (1) ), (3.4) (incuding biases) can be randomy assigned and do not require tuning provided the activation function is infinitey differentiabe. This aows us to write down the foowing agorithm: 1. Randomy assign input weights w (1). 2. Using these weights and the avaiabe training data, write down the hidden ayer output matrix H = h(w (1) 1 x 1)... h(w (1) L x 1).... h(w (1) 1 x D)... h(w (1) L x D) 3. An estimate ˆβ of the vector of output weights w (2) can then be found by soving Hˆβ T where T is the matrix of target vectors. This has the soution where H is the Moore-Penrose generaised inverse of the matrix H. ˆβ = H T, (3.5) The agorithm has been used for a variety of appications, such as image processing, faut detection and forecasting. Despite its advantages, its use is imited as it can ony be appied to singe-ayer ANNs. Whie the agorithm is faster than backpropagation, generay the probem of finding good oca minima with error backpropagation is hard ony when there are a ot of ayers in the network. Therefore many procedures which require more than one hidden ayer, such as convoutiona neura networks discussed in the next section, sti use error backpropagation. 4 Deep earning 4.1 Convoutiona neura networks A maor appication for neura networks is image anaysis for machine earning. However in order for the network to pick up subte differences, for instance in handwriting, the network needs to be consideraby resistant to transformations in the input data. Technicay a neura network shoud eventuay become invariant to certain types of transformation by simpy being exposed to it in the training data. Therefore one technique is to augment the existing training data by appying random transformations to the data. Whie this is a beneficia technique, in practice it requires a arge neura network to be trained on a vast amounts of data which is computationay expensive. 3

A technique that has had a huge impact on machine earning is known as convoutiona neura networks (LeCun et a., 1998). The procedure takes advantage of the high correation that generay occurs between neighbouring pixes in images. This is performed by arranging the first hidden ayer units into a number of different grids, each grid is referred to as a feature map. This ayer of the neura network is known as the convoutiona ayer. Each unit wi take an input of a sma part of the image (for instance a 5 5 pixe grid). However the units within each feature map are constrained to have the same weight parameters. This aows each feature map to pick up different patterns in the images, but since each unit in the feature map is constrained to have the same weights, the ocation of the pattern is ess important. In order to make the neura network further invariant to transformations of the data, the convoutiona ayer passes its outputs to a second hidden ayer, known as the subsamping ayer. This ayer is again arranged into grids of units, however each unit in the subsamping ayer takes an input of severa units in the convoutiona ayer. This has the effect of reducing the resoution of the data. The fina ayer in a neura network wi typicay be a fuy connected ayer, with no weight sharing constraints. In practice a Convoutiona Neura Network wi have mutipe ayers of convoutiona and subsamping ayers. Figure 2 iustrates the LeNet-5 described in detai by LeCun et a. (1998). The weight sharing among units in the feature maps means there is a significant reduction in the parameters that need to be trained. This aows reativey arge convoutiona neura networks to be trained using a simpe modification of error backpropagation. However the depth of convoutiona ANNs, which tends to ead to better earning, is sti consideraby imited by computationa power. Figure 2: Iustration of the convoutiona neura network deveoped by (LeCun et a., 1998), [Source: (LeCun et a., 1998)]. 4.2 Energy based modes and deep beief networks As understanding of bioogica neura networks improved it became apparent that an important step for AI woud be the abiity to train ANNs with many hidden ayers. Few deep networks coud be trained efficienty with traditiona gradient based methods, though a notabe exception was convoutiona neura networks. In most deep networks the agorithm woud often get stuck at sadde points or poor oca minima, and the trained network woud often perform worse than a network with fewer ayers. A turning point in the study of ANNs was when Hinton et a. (2006) introduced deep beief networks (DBNs), which is a type of generative neura network where each unit is treated as a random variabe, rather than being deterministic. They showed that this network can be initiay trained efficienty ayer by ayer in an unsupervised way to find good vaues for the network parameters. These parameters can then be used to initiaise a sower agorithm for supervised earning. In order to introduce deep beief networks we first briefy outine energy based modes which can be used to define DBNs. Foowing the definition of Bengio (2009), energy based modes associate an energy vaue E(.) to each configuration of the variabes of interest x. The basic idea is that we woud ike pausibe configurations of x to have ow energy. In this sense we can define a probabiity distribution as P (x) = e E(x) Z, (4.1) 4

where Z = x e E(x) is a normaizing constant known as the partition function. We can therefore see that configurations with a ower energy function have a higher probabiity of being reaised. An extension of this is through the introduction of hidden, or unobserved variabes h. In this case the oint probabiity distribution can be written as in (4.1), but the energy now depends on x and h. The margina distribution of the observed variabes x can then be written as P (x) = h e E(x,h) Z = e F (x) Z, (4.2) where F (x) = og h e E(x,h), is known as the free energy. Given mode parameters θ, (4.2) aows us to write down the gradient of the og ikeihood as θ og P (x; θ) = θ F (x; θ) + 1 e F ( x;θ) θ F ( x; θ) (4.3) Z x = θ F (x; θ) + E P ( θ F (x; θ)), (4.4) where E P is the expectation under the margina distribution P. Therefore an estimate of the gradient of the og ikeihood can be found if we are abe to sampe from P and can compute the free energy tractaby. Restricted Botzmann Machines are a generative neura network that satisfy these properties, and form a key component of a DBN. A Restricted Botzmann Machine (RBM) is a type of Markov random fied that has two ayers of units, one visibe x, one hidden h. A graphica mode of an RBM is given in Figure 3a. As shown by Figure 3a, a key feature of RBMs is that the visibe units are independent of each other given the vaues of the hidden units and vice versa. h (2) h W x (a) Graphica mode of a Restricted Botzmann Machine. W (2) h (1) W (1) x (b) Graphica mode of a deep beief network. RBM Figure 3: Graphica modes for a Restricted Botzmann Machine and a deep beief network. The energy of an RBM is given by E(x, h) = b x c h h Wx, (4.5) where θ = {b, c, W} are the mode parameters. W are the interaction terms between the observed and hidden units and b and c are the bias vaues for the observed and hidden ayers respectivey. The conditiona independence of the units in the same ayer aows us to factorize the conditiona distributions as P (h x) = i P (h i x), so that we are abe to obtain a tractabe expression for P (h i x). Simiary we can obtain expressions for P (x i h). The form of the energy function aows the free energy to be computed efficienty as F (x) = b x i og hi e h i(c i +w i x). (4.6) 5

The tractabe expressions for P (h i x) and P (x i h) aow us to sampe from the distribution P using a Gibbs samper, where we initiaise using a sampe of the training data. Using this sampe, the observed data and (4.6); we are abe to obtain an approximation of the og ikeihood gradient. This used in conunction with a gradient descent agorithm aows us to approximatey minimize the og ikeihood to find maximum ikeihood estimates ˆθ. This procedure is known as contrastive divergence, a more detaied description of the agorithm can be found in Bengio (2009). A graphica mode of a deep beief network (DBN) is given in Figure 3b. As can be seen from the graphica mode, given a DBN with ayers, the units in the first 2 ayers of a deep beief network are independent given the vaues of the units in the ayers above. The oint distribution of the top two ayers in a DBN is the same as for a RBM. This aows the oint distribution to be written as ( 2 ) P (x, h (1),..., h () ) = P (h ( 1), h () ) P (h k h k+1 ) P (h (1) x), (4.7) where P (h ( 1), h () ) has the same form as a Restricted Botzmann Machine. Contrastive divergence for RBMs aows us to write down an agorithm for unsupervised training of deep beief networks by training each ayer as a Restricted Botzmann Machine: 1. Using contrastive divergence, find estimates for the first ayer parameters θ (1). 2. Fix the first ayer parameters θ (1) and using sampes from h (1) find estimates for the second ayer parameters θ (2) using contrastive divergence. 3. Recursivey repeat step 2 for the rest of the ayers in the network. After training the weights of the DBN in an unsupervised way as in the agorithm above, the weights can be used to initiaise a supervised agorithm which uses abeed training data. Since DBNs deveopment by Hinton et a. (2006), a variety of other deep architectures have been introduced, incuding convoutiona DBNs, and the fied of deep earning has gained significant industria interest. However more recenty, due to a significant increase in processing power, agorithms using traditiona convoutiona ANNs with many ayers and error backpropagation have become more feasibe. GPU-based impementations of traditiona convoutiona ANNs have been found to sti produce the best resuts for certain pattern recognition tasks such as image anaysis and speech recognition. A key criticism of DBNs is that the theoretica properties of the methods introduced, for exampe contrastive divergence, are not we understood. References Basheer, I. and Hameer, M. (2000). Artificia neura networks: fundamentas, computing, design, and appication. Journa of microbioogica methods, 43(1):3 31. Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1 127. Bishop, C. M. (2006). Pattern recognition and machine earning, voume 4. Springer. Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast earning agorithm for deep beief nets. Neura computation, 18(7):1527 1554. Huang, G.-B., Zhu, Q.-Y., and Siew, C.-K. (2006). Extreme earning machine: theory and appications. Neurocomputing, 70(1):489 501. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based earning appied to document recognition. Proceedings of the IEEE, 86(11):2278 2324. k=1 6