Factoring Variations in Natural Images with Deep Gaussian Mixture Models

Factoring Variations in Natural Images with Dee Gaussian Mixture Models Aäron van den Oord, Benjamin Schrauwen Electronics and Information Systems deartment (ELIS), Ghent University {aaron.vandenoord, benjamin.schrauwen}@ugent.be Abstract Generative models can be seen as the swiss army knives of machine learning, as many roblems can be written robabilistically in terms of the distribution of the data, including rediction, reconstruction, imutation and simulation. One of the most romising directions for unsuervised learning may lie in Dee Learning methods, given their success in suervised learning. However, one of the current roblems with dee unsuervised learning methods, is that they often are harder to scale. As a result there are some easier, more scalable shallow methods, such as the Gaussian Mixture Model and the Student-t Mixture Model, that remain surrisingly cometitive. In this aer we roose a new scalable dee generative model for images, called the Dee Gaussian Mixture Model, that is a straightforward but owerful generalization of GMMs to multile layers. The arametrization of a Dee GMM allows it to efficiently cature roducts of variations in natural images. We roose a new EM-based algorithm that scales well to large datasets, and we show that both the Exectation and the Maximization stes can easily be distributed over multile machines. In our density estimation exeriments we show that deeer GMM architectures generalize better than more shallow ones, with results in the same ballark as the state of the art. 1 Introduction There has been an increasing interest in generative models for unsuervised learning, with many alications in Image rocessing [1, 2], natural language rocessing [3, 4], vision [5] and audio [6]. Generative models can be seen as the swiss army knives of machine learning, as many roblems can be written robabilistically in terms of the distribution of the data, including rediction, reconstruction, imutation and simulation. One of the most romising directions for unsuervised learning may lie in Dee Learning methods, given their recent results in suervised learning [7]. Although not a universal recie for success, the merits of dee learning are well-established [8]. Because of their multilayered nature, these methods rovide ways to efficiently reresent increasingly comlex relationshis as the number of layers increases. Shallow methods will often require a very large number of units to reresent the same functions, and may therefore overfit more. Looking at real-valued data, one of the current roblems with dee unsuervised learning methods, is that they are often hard to scale to large datasets. This is esecially a roblem for unsuervised learning, because there is usually a lot of data available, as it does not have to be labeled (e.g. images, videos, text). As a result there are some easier, more scalable shallow methods, such as the Gaussian Mixture Model (GMM) and the Student-t Mixture Model (STM), that remain surrisingly cometitive [2]. Of course, the disadvantage of these mixture models is that they have less reresentational ower than dee models. In this aer we roose a new scalable dee generative model for images, called the Dee Gaussian Mixture Model (Dee GMM). The Dee GMM is a straightforward but owerful generalization of Gaussian Mixture Models to multile layers. It is constructed by stacking multile GMM-layers on 1

N (0,I N (0,I N n ) (0,I n ) n ) N (0,I (0,I N (0,I n ) n ) n ) N (0,I N (0,I N n (0,I ) n ) n ) A 1,1 A 1,1 1,1 A 1,2 AA 1,2 1,2 A 1,3 A 1,3 A 1,3 AA A A 1 A 1 A 2 A 2 2 A 3 A 3 A 2,1 A 2,1 A 2,1 2,2 A 2,2 A 2,2 A 3,1 A 3,1 A 3,2 A 3,2 A 3,3 A 3,3 3,1 A 3,2 A 3,3 x x x (a) Gaussian x x x (b) GMM x x x (c) Dee GMM Figure 1: Visualizations of a Gaussian, GMM and Dee GMM distribution. Note that these are not grahical models. This visualization describes the connectivity of the linear transformations that make u the multimodal structure of a dee GMM. The samling rocess for the dee GMM is shown in red. Every time a samle is drawn, it is first drawn from a standard normal distribution and then transformed with all the transformations on a randomly samled ath. In the examle it is first transformed with A 1,3, then with A 2,1 and finally with A 3,2. Every ath results in differently correlated normal random variables. The dee GMM shown has 3 2 3 = 18 ossible aths. For each square transformation matrix A i,j there is a corresonding bias term b i,j (not shown here). to of each other, which is similar to many other Dee Learning techniques. Although for every dee GMM, one could construct a shallow GMM with the same density function, it would require an exonential number of mixture comonents to do so. The multilayer architecture of the Dee GMM gives rise to a secific kind of arameter tying. The arameterization is most interretable in the case of images: the layers in the architecture are able to efficiently factorize the different variations that are resent in natural images: changes in brightness, contrast, color and even translations or rotations of the objects in the image. Because each of these variations will affect the image searately, a traditional mixture model would need an exonential number of comonents to model each combination of variations, whereas a Dee GMM can factor these variations and model them individually. The roosed training algorithm for the Dee GMM is based on the most oular rincile for training GMMs: Exectation Maximization (EM). Although stochastic gradient (SGD) is also a ossible otion, we suggest the use of EM, as it is inherently more arallelizable. As we will show later, both the Exectation and the Maximization stes can easily be distributed on multile comutation units or machines, with only limited communication between comute nodes. Although there has been a lot of effort in scaling u SGD for dee networks [9], the Dee GMM is arallelizable by design. The remainder of this aer is organized as follows. We start by introducing the design of dee GMMs before exlaining the EM algorithm for training them. Next, we discuss the exeriments where we examine the density estimation erformance of the dee GMM, as a function of the number of layers, and in comarison with other methods. We conclude in Section 5, where also discuss some unsolved roblems for future work. 2 Stacking Gaussian Mixture layers Dee GMMs are best introduced by looking at some secial cases: the multivariate normal distribution and the Gaussian Mixture model. One way to define a multivariate normal variable x is as a standard normal variable z N(0,I n ) that has been transformed with a certain linear transformation: x = Az + b, so that (x) =N x b, AA T. 2

This is visualized in Figure 1(a). The same interretation can be alied to Gaussian Mixture Models, see Figure 1(b). A transformation is chosen from set of (square) transformations A i,i=1...n (each having a bias term b i ) with robabilities i,i =1...N, such that the resulting distribution becomes: N (x) = i N x b i,a i A T i. i=1 With this in mind, it is easy to generalize GMMs in a multi-layered fashion. Instead of samling one transformation from a set, we can samle a ath of transformations in a network of k layers, see Figure 1(c). The standard normal variable z is now successively transformed with a transformation from each layer of the network. Let be the set of all ossible aths through the network. Each ath =( 1, 2,..., k ) 2 has a robability of being samled, with = 2 (1, 2,..., k ) =1. 1, 2,..., k Here N j is the number of comonents in layer j. The density function of x is: (x) = 2 N x, T, (1) with = b k,k + A k,ik (...(b 2,2 + A 2,2 b 1,1 )) (2) = 1Y A j,j. (3) j=k Here A m,n and b m,n are the n th transformation matrix and bias of the m th layer. Notice that one can also factorize as follows: (1, 2,..., k ) = 1 2... k, so that each layer has its own set of arameters associated with it. In our exeriments, however, this had very little difference on the log likelihood. This would mainly be useful for very large networks. The GMM is a secial case of the dee GMM having only one layer. Moreover, each dee GMM can be constructed by a GMM with Q k j N j comonents, where every ath in the network reresents one comonent in the GMM. The arameters of these comonents are tied to each other in the way the dee GMM is defined. Because of this tying, the number of arameters to train is roortional to P k j N j. Still, the density estimator is quite exressive as it can reresent a large number of Gaussian mixture comonents. This is often the case with dee learning methods: Shallow architectures can often theoretically learn the same functions, but will require a much larger number of arameters [8]. When the kind of comound functions that a dee learning method is able to model are aroriate for the tye of data, their erformance will often be better than their shallow equivalents, because of the smaller risk of overfitting. In the case of images, but also for other tyes of data, we can imagine why this network structure might be useful. A lot of images share the same variations such as rotations, translations, brightness changes, etc.. These deformations can be reresented by a linear transformation in the ixel sace. When learning a dee GMM, the model may ick u on these variations in the data that are shared amongst images by factoring and describing them with the transformations in the network. The hyothesis of this aer is that Dee GMMs overfit less than normal GMMs as the comlexity of their density functions increase because the arameter tying of the Dee GMM will force it to learn more useful functions. Note that this is one of the reasons why other dee learning methods are so successful. The only difference is that the arameter tying in dee GMMs is more exlicit and interretable. A closely related method is the dee mixture of factor analyzers (DMFA) model [10], which is an extension of the Mixture of Factor Analyzers (MFA) model [11]. The DMFA model has a tree structure in which every node is a factor analyzer that inherits the low-dimensional latent factors 3

from its arent. Training is erformed layer by layer, where the dataset is hierarchically clustered and the children of each node are trained as a MFA on a different subset of the data using the MFA EM algorithm. The arents nodes are ket constant when training its children. The main difference with the roosed method is that in the Dee GMM the nodes of each layer are connected to all nodes of the layer above. The layers are trained jointly and the higher level nodes will adat to the lower level nodes. 3 Training dee GMMs with EM The algorithm we roose for training Dee GMMs is based on Exectation Maximization (EM). The otimization is similar to that of a GMM: in the E-ste we will comute the osterior robabilities n that a ath was resonsible for generating x n, also called the resonsibilities. In the maximization ste, the arameters of the model will be otimized given those resonsibilities. 3.1 Exectation From Equation 1 we get the the log-likelihood given the data: 2 log (x n )= log 4 N x n, T n n 2 This is the global objective for the Dee GMM to otimize. When taking the derivative with resect to a arameter we get: r log (x n ) = N x n, T r log N x n, T P n n, q N x n q, q T q with = n, q nr log N x n, T, n = N x n, T P q N x n q, q T q q2 the equation for the resonsibilities. Although n generally deend on the arameter, in the EM algorithm the resonsibilities are assumed to remain constant when otimizing the model arameters in the M-ste. The E-ste is very similar to that of a standard GMM, but instead of comuting the resonsibilities nk for every comonent k, one needs to comute them for every ath = ( 1, 2,..., k ) 2. This is because every ath reresents a Gaussian mixture comonent in the equivalent shallow GMM. Because n needs to be comuted for each dataoint indeendently, the E-ste is very easy to arallelize. Often a simle way to increase the seed of convergence and to reduce comutation time is to use an EM-variant with hard assignments. Here only one of the resonsibilities of each dataoint is set to 1:, 3 5. n = 1 = arg maxq q N x n q, q T q 0 otherwise (4) Heuristic Because the number of aths is the roduct of the number of comonents er layer ( Q k j N j), comuting the resonsibilities can become intractable for big Dee GMM networks. However, when using hard-em variant (eq. 4), this roblem reduces to finding the best ath for each dataoint, for which we can use efficient heuristics. Here we introduce such a heuristic that does not hurt the erformance significantly, while allowing us to train much larger networks. We otimize the ath =( 1, 2,..., k ), which is a multivariate discrete variable, with a coordinate ascent algorithm. This means we change the arameters i layer er layer, while keeing the 4

(a) Iterations (b) Reinitializations (c) Switch rate during training Figure 2: Visualizations for the introduced E-ste heuristic. (a): The average log-likelihood of the best-ath search with the heuristic as a function of the number of iterations (asses) and (b): as a function of the number of reeats with a different initialization. Plot (c) shows the ercentage of data oints that switch to a better ath found with a different initialization as a function of the number of the EM-iterations during training. arameter values of the other layers constant. After we have changed all the variables one time (one ass), we can reeat. The heuristic described above only requires P k j N j ath evaluations er ass. In Figure 2 we comare the heuristic with the full search. On the left we see that after 3 asses the heuristic converges to a local otimum. In the middle we see that when reeating the heuristic algorithm a coule of times with different random initializations, and keeing the best ath after each iteration, the loglikelihood converges to the otimum. In our exeriments we initialized the heuristic with the otimal ath from the revious E-ste (warm start) and erformed the heuristic algorithm for 1 ass. Subsequently we ran the algorithm for a second time with a random initialization for two asses for the ossibility of finding a better otimum for each dataoint. Each E-ste thus required 3 Pk j N j ath evaluations. In Figure 2(c) we show an examle of the ercentage of data oints (called the switch-rate) that had a better otimum with this second initialization for each EM-iteration. We can see from this Figure that the switchrate quickly becomes very small, which means that using the resonsibilities from the revious E-ste is an efficient initialization for the current one. Although the number of ath evaluations with the heuristic is substantially smaller than with the full search, we saw in our exeriments that the erformance of the resulting trained Dee GMMs was ultimately similar. 3.2 Maximization In the maximization ste, the arameters are udated to maximize the log likelihood of the data, given the resonsibilities. Although standard otimization techniques for training dee networks can be used (such as SGD), Dee GMMs have some interesting roerties that allow us to train them more efficiently. Because these roerties are not obvious at first sight, we will derive the objective and gradient for the transformation matrices A i,j in a Dee GMM. After that we will discuss various ways for otimizing them. For convenience, the derivations in this section are based on the hard-em variant and with omission of the bias-terms arameters. Equations without these simlifications can be obtained in a similar manner. In the hard-em variant, it is assumed that each dataoint in the dataset was generated by a ath, for which n, =1. The likelihood of x given the arameters of the transformations on this ath is (x) = A1, 1 1... A 1 k, k N A1, 1 1...A 1 k, k x 0,I n, (5) where we use to denote the absolute value of the determinant. Now let s rewrite: z = A 1 i+1, i+1...a 1 k, k x (6) Q = A 1 i, i (7) R = A 1 1, 1...A 1 i 1, i 1, (8) 5

N (0, I n ) R 1 R 2... R i... R m "Folded" version of all the layers above the current layer... Q... Current layer z Figure 3: Otimization of a transformation Q in a Dee GMM. We can rewrite all the ossible aths in the above layers by folding them into one layer, which is convenient for deriving the objective and gradient equations of Q. so that we get (omitting the constant term w.r.t. Q): log (x) / log Q + log N (R Qz 0,I n ). (9) Figure 3 gives a visual overview. We have folded the layers above the current layer into one. This means that each ath through the network above the current layer is equivalent to a transformation R in the folded version. The transformation matrix for which we will derive the objective and gradient is called Q. The average log-likelihood of all the data oints that are generated by aths that ass through Q is: 1 log (x i ) / log Q + 1 log N (R Qz i 0,I) (10) N N i i2 1 = log Q Tr Q T Q, (11) 2 where = N N, = 1 P N z i zi T and = R T R. For the gradient we get: i2 1 N r Q log (x i )=Q T Q T. (12) Otimization i Notice how in Equation 11 the summation over the data oints has been converted to a summation over covariance matrices: one for each ath 1. If the number of aths is small enough, this means we can use full gradient udates instead of mini-batched udates (e.g. SGD). The comutation of the covariance matrices is fairly efficient and can be done in arallel. This formulation also allows us to use more advanced otimization methods, such as LBFGS-B [12]. In the setu described above, we need to kee the transformation R constant while otimizing Q. This is why in each M-ste the Dee GMM is otimized layer-wise from to to bottom, udating one layer at a time. It is ossible to go over this rocess multile times for each M-ste. Imortant to note is that this way the otimization of Q does not deend on any other arameters in the same layer. So for each layer, the otimization of the different nodes can be done in arallel on multile cores or machines. Moreover, nodes in the same layer do not share data oints when using the EMvariant with hard-assignments. Another advantage is that this method is easy to control, as there are no learning rates or other otimization arameters to be tuned, when using L-BFGS-B out of the box. A disadvantage is that one needs to sum over all ossible aths above the current node in the gradient comutation. For deeer networks, this may become roblematic when otimizing the lower-level nodes. Alternatively, one can also evaluate (11) using Kronecker roducts as ) = log Q +vec(q) T ( ( ) vec (Q) (13) 1 Actually we only need to sum over the number of ossible transformations R above the node Q. 6

and Equation 12 as = Q T + 2 mat ( )! ( ) vec (Q). (14) Here vec is the vectorization oerator and mat its inverse. With these formulations we don t have to loo over the number of aths anymore during the otimization. This makes the inner otimization with LBFGS-B even faster. We only have to construct P ( ) once, which is also easy to arallelize. These equation thus allow us to train even bigger Dee GMM architectures. A disadvantage, however, is that it requires the dimensionality of the data to be small enough to efficiently construct the Kronecker roducts. When the aforementioned formulations are intractable because there are too number layers in the Dee GMM and the data dimensionality is to high, we can also otimize the arameters using backroagation with a minibatch algorithm, such as Stochastic Gradient Descent (SGD). This aroach works for much deeer networks, because we don t need to sum over the number of aths. From Equation 9 we see that this is basically the same as minimizing the L2 norm of R Qz, with log Q as regularization term. Disadvantages include the use of learning rates and other arameters such as momentum, which requires more engineering and fine-tuning. The most naive way is to otimize the dee GMM with SGD is by simultaneously otimizing all arameters, as is common in neural networks. When doing this it is imortant that the arameters of all nodes are converged enough in each M-ste, otherwise nodes that are not otimized enough may have very low resonsibilities in the following E-ste(s). This results in whole arts of the network becoming unused, which is the equivalent of emty clusters during GMM or k-means training. An alternative way of using SGD is again by otimizing the Dee GMM layer by layer. This has the advantage that we have more control over the otimization, which revents the aforementioned roblem of unused aths. But more imortantly, we can now again arallelize over the number of nodes er layer. 4 Exeriments and Results For our exeriments we used the Berkeley Segmentation Dataset (BSDS300) [13], which is a commonly used benchmark for density modeling of image atches and the tiny images dataset [14]. For BSDS300 we follow the same setu of Uria et al. [15], which is best ractice for this dataset. 8 by 8 grayscale atches are drawn from images of the dataset. The train and test sets consists of 200 and 100 images resectively. Because each ixel is quantized, it can only contain integer values between 0 and 255. To make the integer ixel values continuous, uniform noise (between 0 and 1) is added. Afterwards, the images are divided by 256 so that the ixel values lie in the range [0, 1]. Next, the atches are rerocessed by removing the mean ixel value of every image atch. Because this reduces the imlicit dimensionality of the data, the last ixel value is removed. This results in the data oints having 63 dimensions. For the tiny images dataset we rescale the images to 8 by 8 and then follow the same setu. This way we also have low resolution image data to evaluate on. In all the exeriments described in this section, we used the following setu for training Dee GMMs. We used the hard-em variant, with the aforementioned heuristic in the E-ste. For each M-ste we used LBFGS-B for 1000 iterations by using equations (13) and (14) for the objective and gradient. The total number of iterations we used for EM was fixed to 100, although fewer iterations were usually sufficient. The only hyerarameters were the number of comonents for each layer, which were otimized on a validation set. Because GMMs are in theory able to reresent the same robability density functions as a Dee GMM, we first need to assess wether using multile layers with a dee GMM imroves erformance. The results of a GMM (one layer) and Dee GMMs with two or three layers are given in 4(a). As we increase the comlexity and number of arameters of the model by changing the number of comonents in the to layer, a lateau is reached and the models ultimately start overfitting. For the dee GMMs, the number of comonents in the other layers was ket constant (5 comonents). The Dee GMMs seem to generalize better. Although they have a similar number of arameters, they are able to model more comlex relationshis, without overfitting. We also tried this exeriment on a more difficult dataset by using highly downscaled images from the tiny images dataset, see Figure 7

(a) BSDS300 (b) Tiny Images Figure 4: Performance of the Dee GMM for different number of layers, and the GMM (one layer). All models were trained on the same dataset of 500 Thousand examles. For comarison we varied the number of comonents in the to layer. 4(b). Because there are less correlations between the ixels of a downscaled image than between those of an image atch, the average log likelihood values are lower. Overall we can see that the Dee GMM erforms well on both low and high resolution natural images. Next we will comare the dee GMM with other ublished methods on this task. Results are shown in Table 1. The first method is the RNADE model, a new dee density estimation technique which is an extension of the NADE model for real valued data [16, 15]. EoRNADE, which stands for ensemble of RNADE models, is currently the state of the art. We also reort the log-likelihood results of two mixture models: the GMM and the Student-T Mixture model, from [2]. Overall we see that the Dee GMM has a strong erformance. It scores better than other single models (RNADE, STM), but not as well as the ensemble of RNADE models. Model Average log likelihood RNADE: 1hl, 2hl, 3hl; 4hl, 5hl, 6hl 143.2, 149.2, 152.0, 153.6, 154.7, 155.2 EoRNADE (6hl) 157.0 GMM 153.7 STM 155.3 Dee GMM - 3 layers 156.2 Table 1: Density estimation results on image atch modeling using the BSDS300 dataset. Higher log-likelihood values are better. hl stands for the number of hidden layers in the RNADE models. 5 Conclusion In this work we introduced the dee Gaussian Mixture Model: a novel density estimation technique for modeling real valued data. we show that the Dee GMM is on ar with the current state of the art in image atch modeling, and surasses other mixture models. We conclude that the Dee GMM is a viable and scalable alternative for unsuervised learning. The dee GMM tackles unsuervised learning from a different angle than other recent dee unsuervised learning techniques [17, 18, 19], which makes it very interesting for future research. In follow-u work, we would like to make Dee GMMs suitable for larger images and other highdimensional data. Locally connected filters, such as convolutions would be useful for this. We would also like to extend our method to modeling discrete data. Dee GMMs are currently only designed for continuous real-valued data, but our aroach of rearametrizing the model into layers of successive transformations can also be alied to other tyes of mixture distributions. We would also like to comare this extension to other discrete density estimators such as Restricted Boltzmann Machines, Dee Belief Networks and the NADE model [15]. 8

References [1] Daniel Zoran and Yair Weiss. From learning models of natural image atches to whole image restoration. In International Conference on Comuter Vision, 2011. [2] Aäron van den Oord and Benjamin Schrauwen. The student-t mixture model as a natural image atch rior with alication to image comression. Journal of Machine Learning Research, 2014. [3] Yoshua Bengio, Holger Schwenk, Jean-Sbastien Sencal, Frderic Morin, and Jean-Luc Gauvain. Neural robabilistic language models. In Innovations in Machine Learning. Sringer, 2006. [4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word reresentations in vector sace. In roceedings of Worksho at ICLR, 2013. [5] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. One-shot learning by inverting a comositional causal rocess. In Advances in Neural Information Processing Systems, 2013. [6] Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. How to construct dee recurrent neural networks. In Proceedings of the International Conference on Learning Reresentations, 2013. [7] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with dee convolutional neural networks. In Advances in Neural Information Processing Systems, 2012. [8] Yoshua Bengio. Learning dee architectures for ai. Foundations and Trends R in Machine Learning, 2(1), 2009. [9] Alex Krizhevsky. One weird trick for arallelizing convolutional neural networks. In Proceedings of the International Conference on Learning Reresentations, 2014. [10] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Dee mixtures of factor analysers. In International Conference on Machine Learning, 2012. [11] Zoubin Ghahramani and Geoffrey E Hinton. The em algorithm for mixtures of factor analyzers. Technical reort, University of Toronto, 1996. [12] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained otimization. SIAM Journal on Scientific Comuting, 1995. [13] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its alication to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the International Conference on Comuter Vision. IEEE, 2001. [14] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A large data set for nonarametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. [15] Benigno Uria, Iain Murray, and Hugo Larochelle. A dee and tractable density estimator. In Proceedings of the International Conference on Machine Learning, 2013. [16] Benigno Uria, Iain Murray, and Hugo Larochelle. RNADE: The real-valued neural autoregressive density-estimator. In Advances in Neural Information Processing Systems, 2013. [17] Karol Gregor, Andriy Mnih, and Daan Wierstra. Dee autoregressive networks. In International Conference on Machine Learning, 2013. [18] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-roagation and variational inference in dee latent gaussian models. In International Conference on Machine Learning, 2014. [19] Yoshua Bengio, Eric Thibodeau-Laufer, and Jason Yosinski. Dee generative stochastic networks trainable by backro. In International Conference on Machine Learning, 2013. 9