On Optimization Methods for Deep Learning

Size: px
Start display at page:

Download "On Optimization Methods for Deep Learning"

Transcription

1 Quoc V. Le Jiquan Ngiam Adam Coates Abhik Lahiri Bobby Prochnow Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA Abstract The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between L- BFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69% on the standard MNIST dataset. This is a state-of-theart result on MNIST among algorithms that do not use distortions or pretraining. 1. Introduction Stochastic Gradient Descent methods (SGDs) have been extensively employed in machine learning (Bottou, 1991; LeCun et al., 1998; Shalev-Shwartz et al., 2007; Bottou & Bousquet, Appearing in Proceedings of the 28 th International Conference on Machine Learning, Bellevue, WA, USA, Copyright 2011 by the author(s)/owner(s). 2008; Zinkevich et al., 2010). A strength of SGDs is that they are simple to implement and also fast for problems that have many training examples. However, SGD methods have many disadvantages. One key disadvantage of SGDs is that they require much manual tuning of optimization parameters such as learning rates and convergence criteria. If one does not know the task at hand well, it is very difficult to find a good learning rate or a good convergence criterion. A standard strategy in this case is to run the learning algorithm with many optimization parameters and pick the model that gives the best performance on a validation set. Since one needs to search over the large space of possible optimization parameters, this makes SGDs difficult to train in settings where running the optimization procedure many times is computationally expensive. The second weakness of SGDs is that they are inherently sequential: it is very difficult to parallelize them using GPUs or distribute them using computer clusters. Batch methods, such as Limited memory BFGS (L- BFGS) or Conjugate Gradient (CG), with the presence of a line search procedure, are usually much more stable to train and easier to check for convergence. These methods also enjoy parallelism by computing the gradient on GPUs (Raina et al., 2009) and/or distributing that computation across machines (Chu et al., 2007). These methods, conventionally considered to be slow, can be fast thanks to the availability of large amounts of RAMs, multicore CPUs, GPUs and computer clusters with fast network hardware. On a single machine, the speed benefits of L-BFGS come from using the approximated second-order information (modelling the interactions between variables). On the other hand, for CG, the benefits come from using conjugacy information during optimization. Thanks to these bookkeeping steps, L-BFGS and CG

2 can be faster and more stable than SGDs. A weakness of batch L-BFGS and CG, which require the computation of the gradient on the entire dataset to make an update, is that they do not scale gracefully with the number of examples. We found that minibatch training, which requires the computation of the gradient on a small subset of the dataset, addresses this weakness well. We found that minibatch LBFGS/CG are fast when the dataset is large. Our experimental results reflect the different strengths and weaknesses of the different optimization methods. Among the problems we considered, L-BFGS is highly competitive or sometimes superior to SGDs/CG for low dimensional problems, especially convolutional models. For high dimensional problems, CG is more competitive and usually outperforms L-BFGS and SGDs. Additionally, using a large minibatch and line search with SGDs can improve performance. More significant speed improvements of L-BFGS and CG over SGDs are observed in our experiments with sparse autoencoders. This is because having a larger minibatch makes the optimization problem easier for sparse autoencoders: in this case, the cost of estimating the second-order and conjugate information is small compared to the cost of computing the gradient. Furthermore, when training autoencoders, L-BFGS and CG can be both sped up significantly (2x) by simply performing the computations on GPUs. Conversely, only small speed improvements were observed when SGDs are used with GPUs on the same problem. We also present results showing that Map-Reduce style optimization works well for L-BFGS when the model utilizes locally connected networks (Le et al., 2010) or convolutional neural networks (LeCun et al., 1998). Our experimental results show that the speed improvements are close to linear in the number of machines when locally connected networks and convolutional networks are used (up to 8 machines considered in the experiments). We applied our findings to train a convolutional network model (similar to Ranzato et al. (2007)) with L- BFGS on a GPU cluster and obtained 0.69% test set error. This is the state-of-the-art result on MNIST among algorithms that do not use pretraining or distortions. Batch optimization is also behind the success of feature learning algorithms that achieve state-of-the-art performance on a variety of object recognition problems (Le et al., 2010; Coates et al., 2011) and action recognition problems (Le et al., 2011). 2. Related work Optimization research has a long history. Examples of successful unconstrained optimization methods include Newton-Raphson s method, BFGS methods, Conjugate Gradient methods and Stochastic Gradient Descent methods. These methods are usually associated with a line search method to ensure that the algorithms consistently improve the objective function. When it comes to large scale machine learning, the favorite optimization method is usually SGDs. Recent work on SGDs focuses on adaptive strategies for the learning rate (Shalev-Shwartz et al., 2007; Bartlett et al., 2008; Do et al., 2009) or improving SGD convergence by approximating second-order information (Vishwanathan et al., 2007; Bordes et al., 2010). In practice, plain SGDs with constant learning rates or learning rates of the form α β+t are still popular thanks to their ease of implementation. These simple methods are even more common in deep learning (Hinton, 2010) because the optimization problems are nonconvex and the convergence properties of complex methods (Shalev-Shwartz et al., 2007; Bartlett et al., 2008; Do et al., 2009) no longer hold. Recent proposals for training deep networks argue for the use of layerwise pretraining (Hinton et al., 2006; Bengio et al., 2007; Bengio, 2009). Optimization techniques for training these models include Contrastive Divergence (Hinton et al., 2006), Conjugate Gradient (Hinton & Salakhutdinov, 2006), stochastic diagonal Levenberg-Marquardt (LeCun et al., 1998) and Hessian-free optimization (Martens, 2010). Convolutional neural networks (LeCun et al., 1998) have traditionally employed SGDs with the stochastic diagonal Levenberg-Marquardt, which uses a diagonal approximation to the Hessian (LeCun et al., 1998). In this paper, it is our goal to empirically study the pros and cons of off-the-shelf optimization algorithms in the context of unsupervised feature learning and deep learning. In that direction, we focus on comparing L-BFGS, CG and SGDs. Parallel optimization methods have recently attracted attention as a way to scale up machine learning algorithms. Map-Reduce (Dean & Ghemawat, 2008) style optimization methods (Chu et al., 2007; Teo et al., 2007) have been successful early approaches. We also note recent studies (Mann et al., 2009; Zinkevich et al., 2010) that have parallelized SGDs without using the Map-Reduce framework. In our experiments, we found that if we use tiled (locally connected) networks (Le et al., 2010) (which includes convolutional architectures (LeCun et al., 1998;

3 Lee et al., 2009a)), Map-Reduce style parallelism is still an effective mechanism for scaling up. In such cases, the cost of communicating the parameters across the network is small relative to the cost of computing the objective function value and gradient. 3. Deep learning algorithms 3.1. Restricted Boltzmann Machines In RBMs (Smolensky, 1986; Hinton et al., 2006), the gradient used in training is an approximation formed by a taking small number of Gibbs sampling steps (Contrastive Divergence). Given the biased nature of the gradient and intractability of the objective function, it is difficult to use any optimization methods other than plain SGDs. For this reason we will not consider RBMs in our experiments Autoencoders and denoising autoencoders Given an unlabelled dataset {x (i) } m i=1, an autoencoder is a two-layer network that learns nonlinear codes to represent (or reconstruct ) the data. Specifically, we want to learn representations h(x (i) ;W,b) = σ(wx (i) + b) such that σ(w T h(x (i) ;W,b) + c) is approximately x (i), minimize W,b,c mx σ`w T σ(wx (i) + b) + c x (i) 2 i=1 Here, we use the L 2 norm to penalize the difference between the reconstruction and the input. In other studies, when x is binary, the cross entropy cost can also be used (Bengio et al., 2007). Typically, we set the activation function σ to be the sigmoid or hyperbolic tangent function. Unlike RBMs, the gradient of the autoencoder objective can be computed exactly and this gives rise to an opportunity to use more advanced optimization methods, such as L-BFGS and CG, to train the networks. Denoising autoencoders (Vincent et al., 2008) are also algorithms that can be trained by L-BFGS/CG Sparse RBMs and Autoencoders Sparsity regularization typically leads to more interpretable features that perform well for classification. Sparse coding was first proposed by (Olshausen & Field, 1996) as a model of simple cells in the visual cortex. Lee et al. (2007); Raina et al. (2007) applied sparse coding to learn features for machine learning applications. Lee et al. (2008) combined sparsity and RBMs to learn representations that 2 (1) mimic certain properties of the area V2 in the visual cortex. The key idea in their approach is to penalize the deviation between the expected value of the hidden representations E [ h j (x;w,b) ] and a preferred target activation ρ. By setting ρ to be close to zero, the hidden unit will be sparsely activated. Sparse representations have been employed successfully in many applications such as object recognition (Ranzato et al., 2007; Lee et al., 2009a; Nair & Hinton, 2009; Yang et al., 2009), speech recognition (Lee et al., 2009b) and activity recognition (Taylor et al., 2010; Le et al., 2011). Training sparse RBMs is usually difficult. This is due to the stochastic nature of RBMs. Specifically, in stochastic mode, the estimate of the expectation E [ h j (x;w,b) ] is very noisy. A common practice to train sparse RBMs is to use a running estimate of E [ h j (x;w,b) ] and penalizing only the bias (Lee et al., 2008; Larochelle & Bengio, 2008). This further complicates the optimization procedure and makes it hard to debug the learning algorithm. Moreover, it is important to tune the learning rates correctly for the different parameters W, b and c. Consequently, it can be difficult to train sparse RBMs. In our experience, it is often faster and simpler to obtain sparse representations via autoencoders with the proposed sparsity penalties, especially when batch or large minibatch optimization methods are used. In detail, we consider sparse autoencoders with a target activation of ρ and penalize it using the KL divergence (Hinton, 2010): nx D KL ρ 1 m j=1 mx i=1 h j(x (i) ; W, b), (2) where m is the number of examples and n is the number of hidden units. To train sparse autoencoders, we need to estimate the expected activation value for each hidden unit. However, we will not be able to compute this statistic unless we run the optimization method in batch mode. In practice, if we have a small dataset, it is better to use a batch method to train a sparse autoencoder because we do not have to tweak optimization parameters, such as minibatch size, λ as described below. Using a minibatch of size m << m, it is typical to keep a running estimate τ of the expectation E [ h(x;w,b) ]. In this case, the KL penalty is nx D KL ρ λ 1 Xm h j(x (i) ; W, b) + (1 λ)τ m j j=1 i=1 (3)

4 where λ is another tunable parameter Tiled and locally connected networks RBMs and autoencoders have densely-connected network architectures which do not scale well to large images. For large images, the most common approach is to use convolutional neural networks (LeCun et al., 1998; Lee et al., 2009a). Convolutional neural networks have local receptive field architectures: each hidden unit can only connect to a small region of the image. Translational invariance is usually hardwired by weight tying. Recent approaches try to relax this constraint (Le et al., 2010) in their tiled convolutional architectures to also learn other invariances (Goodfellow et al., 2010). Our experimental results show that local architectures, such as tiled convolutional or convolutional architectures, can be efficiently trained with a computer cluster using the Map-Reduce framework. With local architectures, the cost of communicating the gradient over the network is often smaller than the cost of computing it (e.g., cases considered in the experiments). 4. Experiments 4.1. Datasets and computers Our experiments were carried out on the standard MNIST dataset. We used up to 8 machines for our experiments; each machine has 4 Intel CPU cores (at 2.67 GHz) and a GeForce GTX 285 GPU. Most experiments below are done on a single machine unless indicated with parallel. We performed our experiments using Matlab and its GPU-plugin Jacket. 1 For parallel experiments, we used our custom toolbox that makes remote procedure calls in Matlab and Java. In the experiments below, we report the standard metric in machine learning: the objective function evaluated on test data (i.e., test error) against time. We note that the objective function evaluated on the training shows similar trends Optimization methods We are interested in off-the-shelf SGDs, L-BFGS and CG. For SGDs, we used a learning rate schedule of α β+t where t is the iteration number. In our experiments, we found that it is better to use this learning 1 rate schedule than a constant learning rate. We also use momentum, and vary the number of examples used to compute the gradient. In summary, the optimization parameters associated with SGDs are: α, β, momentum parameters (Hinton, 2010) and the number of examples in a minibatch. We run L-BFGS and CG with a fixed minibatch for several iterations and then resample a new minibatch from the larger training set. For each new minibatch, we discard the cached optimization history in L-BFGS/CG. In our settings, for CG and L-BFGS, there are two optimization parameters: minibatch size and number of iterations per minibatch. We use the default values 2 for other optimization parameters, such as line search parameters. For CG and LBFGS, we replaced the minibatch after 3 iterations and 20 iterations respectively. We found that these parameters generally work very well for many problems. Therefore, the only remaining tunable parameter is the minibatch size Autoencoder training We compare L-BFGS, CG against SGDs for training autoencoders. Our autoencoders have hidden units and the sigmoid activation function (σ). As a result, our model has approximately parameters, which is considered challenging for high order optimization methods like L-BFGS. 3 Figure 1. Autoencoder training with units on one machine. 2 We used LBFGS in minfunc by Mark Schmidt and a CG implementation from Carl Rasmussen. We note that both of these methods are fairly optimized implementations of these algorithms; less sophisticated implementations of these algorithms may perform worse. 3 For lower dimensional problems, L-BFGS works much better than other candidates (we omit the results due to space constraints).

5 For L-BFGS, we vary the minibatch size in {1000, 10000}; whereas for CG, we vary the minibatch size in {100, 10000}. For SGDs, we tried 20 combinations of optimization parameters, including varying the minibatch size in {1, 10, 100, 1000} (when the minibatch is large, this method is also called minibatch Gradient Descent). We compared the reconstruction errors on the test set of different optimization methods and summarize the results in Figure 1. For SGDs, we only report the results for two best parameter settings. The results show that minibatch L-BFGS and CG with line search converge faster than carefully tuned plain SGDs. In particular, CG performs better compared to L-BFGS because computing the conjugate information can be less expensive than estimating the Hessian. CG also performs better than SGDs thanks to both the line search and conjugate information. To understand how much estimating conjugacy helps CG, we also performed a control experiment where we tuned (increased) the minibatch size and added a line search procedure to SGDs. the hidden unit activations are computed by averaging over at least about 1000 examples. Figure 3. Sparse autoencoder training with units, ρ = 0.1, one machine. We report the performance of different methods in Figure 3. The results show that L-BFGS/CG are much faster than SGDs. The difference, however, is more significant than in the case of standard autoencoders. This is because L-BFGS and CG prefer larger minibatch size and consequently it is easier to estimate the expected value of the hidden activation. In contrast, SGDs have to deal with a noisy estimate of the hidden activation and we have to set the learning rate parameters to be small to make the algorithm more stable. Interestingly, the line search does not significantly improve SGDs, unlike the previous experiment. A close inspection of the line search shows that initial step sizes are chosen to be slightly smaller (more conservative) than the tuned step size Training autoencoders with GPUs Figure 2. Control experiment with line search for SGDs. The results are shown in Figure 2 which confirm that having a line search procedure makes SGDs simpler to tune and faster. Using information in the previous steps to form the Hessian approximation (L-BFGS) or conjugate directions (CG) further improves the results Sparse autoencoder training In this experiment, we trained the autoencoders with the KL sparsity penalty. The target activation ρ is set to be 10% (a typical value for sparse autoencoders or RBMs). The weighting between the estimate for the current sample and the old estimate (λ) is set to the ratio between the minibatch size m and 1000 (= min { m 1000,1} ). This means that our estimates of Figure 4. Autoencoder training with units, one machine with GPUs. L-BFGS and CG enjoy a speed up of approximately 2x, while no significant improvement is observed for plain SGDs. The idea of using GPUs for training deep learning al-

6 gorithms was first proposed in (Raina et al., 2009). In this section, we will consider GPUs and carry out experiments with standard autoencoders to understand how different optimization algorithms perform. Using the same experimental protocols described in 4.3, we compared optimization methods and their gains in switching from CPUs to GPUs and present the results in Figure 4. From the figure, the speed up gains are much higher for L-BFGS and CG than SGDs. This is because L-BFGS and CG prefer larger minibatch sizes which can be parallelized more efficiently on the GPUs. ments Parallel training of dense networks In this experiment, we explore optimization methods for training autoencoders in a distributed fashion using the Map-Reduce framework (Chu et al., 2007). 4 We also used the same settings for all algorithms as mentioned above in Section 4.3. Our results with training dense autoencoders (omitted due to lack of space) show that parallelizing densely connected networks in this manner can result in slower convergence than running the method on a standalone machine. This can be attributed to the communication costs involve in passing the models and gradients across the network: the parameter vectors have a size of 64Mb, which can be a considerable amount of network traffic since it is frequently communicated Parallel training of local networks If we use tiled (locally connected) networks (Le et al., 2010), Map-Reduce style gradient computation can be used as an effective way for training. In tiled networks, the number of parameters is small and thus the cost of transferring the gradient across the network can often be smaller than the cost of computing it. Specifically, in this experiment, we constrain each hidden unit to connect to a small section of the image. Furthermore, we do not share any weights across hidden units (no weight tying constraints). We learn 20 feature maps, where each map consists of 441 filters, each of size 8x8. The results presented in Figure 5 show that SGDs are slower when a computer cluster is used. On the other hand, thanks to its preference of a larger minibatch size, L-BFGS enjoys more significant speed improve- 4 In detail, for parallelized methods, one central machine ( master ) runs the optimization procedure while the slaves compute the objective values and gradients. At every step during optimization, the master sends the parameter across all slaves, the slaves then compute the objective function and gradient and send back to the master. Figure 5. Parallel training of locally connected networks. With locally connected networks, the communication cost is reduced significantly. The inset figure shows the (L- BFGS) speed improvement as a function of number of slaves. The speed up factor is measured by taking the amount of time that requires each method to reach a test objective equal or better than 2. Also, the figure shows that L-BFGS enjoys an almost linear speed up (up to 8 slave machines considered in the experiments) when locally connected networks are used. On models where the number of parameters is small, L-BFGS s bookkeeping and communication cost are both small compared to gradient computations (which is distributed across the machines) Parallel training of supervised CNNs In this experiment, we compare different optimization methods for supervised training of two-layer convolutional neural networks (CNNs). Specifically, our model has has 16 maps of 5x5 filters in the first layer, followed by (non-overlapping) pooling units that pool over a 3x3 region. The second layer has 16 maps of 4x4 filters, without any pooling units. Additionally, we have a softmax classification layer which is connected to all the output units from the second layer. In this experiment, we distribute the gradient computations across many machines with GPUs. The experimental results (Figure 4.8) show that L- BFGS is better than CG and SGDs on this problem because of low dimensionality (less than parameters). Map-Reduce style parallelism also significantly improves the performance of both L-BFGS and CG. 5 In this experiment, we did not tune the minibatch size, i.e., when we have 4 slaves, the minibatch size per computer is divided by 4. We expect that tuning this minibatch size will improve the results when the number of computers goes up.

7 tortion engineering and/or unsupervised pretraining, e.g., 0.4% (Simard et al., 2003), 0.53% (Jarrett et al., 2009), 0.39% (Ciresan et al., 2010). 5. Discussion Figure 6. Parallel training of CNNs Classification on standard MNIST Finally, we carried out experiments to determine if L- BFGS affects classification accuracy. We used a convolution network with a first layer having 32 maps of 5x5 filters and 3x3 pooling with subsampling. The second layer had 64 maps of 5x5 filters and 2x2 pooling with subsampling. This architecture is similar to that described in Ranzato et al. (2007), with the following two differences: (i) we did not use an additional hidden layer of 200 hidden units; (ii) the receptive field of our first layer pooling units is slightly larger (for computational reasons). Table 1. Classification error on MNIST test set for some representative methods without pretraining. SGDs with diagonal Levenberg-Marquardt are used in (LeCun et al., 1998; Ranzato et al., 2007). LeNet-5, SGDs, no distortions (LeCun et al., 1998) 0.95% LeNet-5, SGDs, huge distortions (LeCun et al., 1998) 0.85% LeNet-5, SGDs, distortions (LeCun et al., 1998) 0.80% ConvNet, SGDs, no distortions (Ranzato et al., 2007) 0.89% ConvNet, L-BFGS, no distortions (this paper) 0.69% We trained our network using 4 machines (with GPUs). For every epoch, we saved the parameters to disk and used a hold-out validation set of examples 6 to select the best model. The best model is used to make predictions on the test set. The results of our method (ConvNet) using minibatch L- BFGS are reported in Table 1. The results show that the CNN, trained with L-BFGS, achieves an encouraging classification result: 0.69%. We note that this is the best result for MNIST among algorithms that do not use unsupervised pretraining or distortions. In particular, engineering distortions, typically viewed as a way to introduce domain knowledge, can improve classification results for MNIST. In fact, state-of-the-art results involve more careful dis- 6 We used a reduced training set of examples throughout the classification experiments. In our experiments, different optimization algorithms appear to be superior on different problems. On contrary to what appears to be a widely-held belief, that SGDs are almost always preferred, we found that L- BFGS and CG can be superior to SGDs in many cases. Among the problems we considered, L-BFGS is a good candidate for optimization for low dimensional problems, where the number of parameters are relatively small (e.g., convolutional neural networks). For high dimensional problems, CG often does well. Sparsity provides another compelling case for using L- BFGS/CG. In our experiments, L-BFGS and CG outperform SGDs on training sparse autoencoders. We note that there are cases where L-BFGS may not be expected to perform well (e.g., if the Hessian is not well approximated with a low-rank estimate). For instance, on local networks (Le et al., 2010) where the overlaps between receptive fields are small, the Hessian has a block-diagonal structure and L-BFGS, which uses low-rank updates, may not perform well. 7 In such cases, algorithms that exploit the problem structures may perform much better. CG and L-BFGS are also methods that can take better advantage of the GPUs thanks to their preference of larger minibatch sizes. Furthermore, if one uses tiled (locally connected) networks or other networks with a relatively small number of parameters, it is possible to compute the gradients in a Map-Reduce framework and speed up training with L-BFGS. Acknowledgments: We thank Andrew Maas, Andrew Saxe, Quinn Slack, Alex Smola and Will Zou for comments and discussions. This work is supported by the DARPA Deep Learning program under contract number FA C References Bartlett, P., Hazan, E., and Rakhlin, A. Adaptive online gradient descent. In NIPS, Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning, pp , Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layerwise training of deep networks. In NIPS, Personal communications with Will Zou.

8 Bordes, A., Bottou, L., and Gallinari, P. SGD-QN: Careful quasi-newton stochastic gradient descent. JMLR, pp , Bottou, L. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nîmes 91, Bottou, L. and Bousquet, O. The tradeoffs of large scale learning. In NIPS Chu, C.T., Kim, S. K., Lin, Y. A., Yu, Y. Y., Bradski, G., Ng, A. Y., and Olukotun, K. Map-Reduce for machine learning on multicore. In NIPS 19, Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, Coates, A., Lee, H., and Ng, A. Y. An analysis of singlelayer networks in unsupervised feature learning. In AIS- TATS 14, Dean, J. and Ghemawat, S. Map-Reduce: simplified data processing on large clusters. Comm. ACM, pp , Do, C.B., Le, Q.V., and Foo, C.S. Proximal regularization for online and batch learning. In ICML, Goodfellow, I., Le, Q.V., Saxe, A., Lee, H., and Ng, A.Y. Measuring invariances in deep networks. In NIPS, Hinton, G. A practical guide to training restricted boltzmann machines. Technical report, U. of Toronto, Hinton, G. E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, Hinton, G. E., Osindero, S., and Teh, Y.W. A fast learning algorithm for deep belief nets. Neu. Comp., Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV, Larochelle, H. and Bengio, Y. Classification using discriminative restricted boltzmann machines. In ICML, Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, Le, Q. V., Zou, W., Yeung, S. Y., and Ng, A. Y. Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In CVPR, LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. Proceeding of the IEEE, pp , LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Neural Networks: Tricks of the trade, pp Springer, Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient sparse coding algorithms. In NIPS, Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief net model for visual area V2. In NIPS, Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009a. Lee, H., Largman, Y., Pham, P., and Ng, A. Y. Unsupervised feature learning for audioclassification using convolutional deep belief networks. In NIPS, 2009b. Mann, G., McDonald, R., Mohri, M., Silberman, N., and Walker, D. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, Martens, J. Deep learning via hessian-free optimization. In ICML, Nair, V. and Hinton, G. E. 3D object recognition with deep belief nets. In NIPS, Olshausen, B. and Field, D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, pp , Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y. Self-taught learning: Transfer learning from unlabelled data. In ICML, Raina, R., Madhavan, A., and Ng, A. Y. Large-scale deep unsupervised learning using graphics processors. In ICML, Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos: Primal estimated sub-gradient solver for svm. In ICML, Simard, P., Steinkraus, D., and Platt, J. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, Smolensky, P. Information processing in dynamical systems: foundations of harmony theory. In Parallel distributed processing, Taylor, G.W., Fergus, R., Lecun, Y., and Bregler, C. Convolutional learning of spatio-temporal features. In ECCV, Teo, C. H., Le, Q. V., Smola, A. J., and Vishwanathan, S. V. N. A scalable modular convex solver for regularized risk minimization. In KDD, Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. Extracting and composing robust features with denoising autoencoders. In ICML, Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., and Murphy, K. P. Accelerated training of conditional random fields with stochastic gradient methods. In ICML, Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, Zinkevich, M., Weimer, M., Smola, A., and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010.

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

An Analysis of Single-Layer Networks in Unsupervised Feature Learning An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates 1, Honglak Lee 2, Andrew Y. Ng 1 1 Computer Science Department, Stanford University {acoates,ang}@cs.stanford.edu 2 Computer

More information

Visualizing Higher-Layer Features of a Deep Network

Visualizing Higher-Layer Features of a Deep Network Visualizing Higher-Layer Features of a Deep Network Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent Dept. IRO, Université de Montréal P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7,

More information

MapReduce/Bigtable for Distributed Optimization

MapReduce/Bigtable for Distributed Optimization MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. kbhall@google.com Scott Gilpin Google Inc. sgilpin@google.com Gideon Mann Google Inc. gmann@google.com Abstract With large data

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence

More information

Tutorial on Deep Learning and Applications

Tutorial on Deep Learning and Applications NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning Tutorial on Deep Learning and Applications Honglak Lee University of Michigan Co-organizers: Yoshua Bengio, Geoff Hinton, Yann LeCun,

More information

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Dan C. Cireşan IDSIA USI-SUPSI Manno, Switzerland, 6928 Email: dan@idsia.ch Ueli Meier IDSIA USI-SUPSI Manno, Switzerland, 6928

More information

The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization

The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization Adam Coates Andrew Y. Ng Stanford University, 353 Serra Mall, Stanford, CA 94305 acoates@cs.stanford.edu ang@cs.stanford.edu

More information

Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines

Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines Marc Aurelio Ranzato Geoffrey E. Hinton Department of Computer Science - University of Toronto 10 King s College Road,

More information

Selecting Receptive Fields in Deep Networks

Selecting Receptive Fields in Deep Networks Selecting Receptive Fields in Deep Networks Adam Coates Department of Computer Science Stanford University Stanford, CA 94305 acoates@cs.stanford.edu Andrew Y. Ng Department of Computer Science Stanford

More information

Deep Learning using Linear Support Vector Machines

Deep Learning using Linear Support Vector Machines Yichuan Tang tang@cs.toronto.edu Department of Computer Science, University of Toronto. Toronto, Ontario, Canada. Abstract Recently, fully-connected and convolutional neural networks have been trained

More information

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations Honglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng Computer Science Department, Stanford University,

More information

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels 3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels Luís A. Alexandre Department of Informatics and Instituto de Telecomunicações Univ. Beira Interior,

More information

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University zeiler@cs.nyu.edu Rob Fergus Department

More information

Sparse deep belief net model for visual area V2

Sparse deep belief net model for visual area V2 Sparse deep belief net model for visual area V2 Honglak Lee Chaitanya Ekanadham Andrew Y. Ng Computer Science Department Stanford University Stanford, CA 9435 {hllee,chaitu,ang}@cs.stanford.edu Abstract

More information

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee ltakeuch@stanford.edu yy.albert.lee@gmail.com Abstract We

More information

A Hybrid Neural Network-Latent Topic Model

A Hybrid Neural Network-Latent Topic Model Li Wan Leo Zhu Rob Fergus Dept. of Computer Science, Courant institute, New York University {wanli,zhu,fergus}@cs.nyu.edu Abstract This paper introduces a hybrid model that combines a neural network with

More information

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition Marc Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, Yann LeCun Courant Institute of Mathematical Sciences,

More information

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing

More information

Efficient online learning of a non-negative sparse autoencoder

Efficient online learning of a non-negative sparse autoencoder and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil

More information

Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images

Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images For Modeling Natural Images Marc Aurelio Ranzato Alex Krizhevsky Geoffrey E. Hinton Department of Computer Science - University of Toronto Toronto, ON M5S 3G4, CANADA Abstract Deep belief nets have been

More information

Large Scale Learning to Rank

Large Scale Learning to Rank Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

More information

arxiv:1312.6062v2 [cs.lg] 9 Apr 2014

arxiv:1312.6062v2 [cs.lg] 9 Apr 2014 Stopping Criteria in Contrastive Divergence: Alternatives to the Reconstruction Error arxiv:1312.6062v2 [cs.lg] 9 Apr 2014 David Buchaca Prats Departament de Llenguatges i Sistemes Informàtics, Universitat

More information

Università degli Studi di Bologna

Università degli Studi di Bologna Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of

More information

An Introduction to Deep Learning

An Introduction to Deep Learning Thought Leadership Paper Predictive Analytics An Introduction to Deep Learning Examining the Advantages of Hierarchical Learning Table of Contents 4 The Emergence of Deep Learning 7 Applying Deep-Learning

More information

Simplified Machine Learning for CUDA. Umar Arshad @arshad_umar Arrayfire @arrayfire

Simplified Machine Learning for CUDA. Umar Arshad @arshad_umar Arrayfire @arrayfire Simplified Machine Learning for CUDA Umar Arshad @arshad_umar Arrayfire @arrayfire ArrayFire CUDA and OpenCL experts since 2007 Headquartered in Atlanta, GA In search for the best and the brightest Expert

More information

Big Data Deep Learning: Challenges and Perspectives

Big Data Deep Learning: Challenges and Perspectives Big Data Deep Learning: Challenges and Perspectives D.saraswathy Department of computer science and engineering IFET college of engineering Villupuram saraswathidatchinamoorthi@gmail.com Abstract Deep

More information

Multi-Column Deep Neural Network for Traffic Sign Classification

Multi-Column Deep Neural Network for Traffic Sign Classification Multi-Column Deep Neural Network for Traffic Sign Classification Dan Cireşan, Ueli Meier, Jonathan Masci and Jürgen Schmidhuber IDSIA - USI - SUPSI Galleria 2, Manno - Lugano 6928, Switzerland Abstract

More information

Rectified Linear Units Improve Restricted Boltzmann Machines

Rectified Linear Units Improve Restricted Boltzmann Machines Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair vnair@cs.toronto.edu Geoffrey E. Hinton hinton@cs.toronto.edu Department of Computer Science, University of Toronto, Toronto, ON

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Feature Engineering in Machine Learning

Feature Engineering in Machine Learning Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon

More information

Novelty Detection in image recognition using IRF Neural Networks properties

Novelty Detection in image recognition using IRF Neural Networks properties Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,

More information

Image Classification for Dogs and Cats

Image Classification for Dogs and Cats Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science kzhou3@ualberta.ca Abstract

More information

Deep learning via Hessian-free optimization

Deep learning via Hessian-free optimization James Martens University of Toronto, Ontario, M5S 1A1, Canada JMARTENS@CS.TORONTO.EDU Abstract We develop a 2 nd -order optimization method based on the Hessian-free approach, and apply it to training

More information

Section for Cognitive Systems DTU Informatics, Technical University of Denmark

Section for Cognitive Systems DTU Informatics, Technical University of Denmark Transformation Invariant Sparse Coding Morten Mørup & Mikkel N Schmidt Morten Mørup & Mikkel N. Schmidt Section for Cognitive Systems DTU Informatics, Technical University of Denmark Redundancy Reduction

More information

Image and Video Understanding

Image and Video Understanding Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material

More information

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Manifold Learning with Variational Auto-encoder for Medical Image Analysis Manifold Learning with Variational Auto-encoder for Medical Image Analysis Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill eunbyung@cs.unc.edu Abstract Manifold

More information

Efficient Learning of Sparse Representations with an Energy-Based Model

Efficient Learning of Sparse Representations with an Energy-Based Model Efficient Learning of Sparse Representations with an Energy-Based Model Marc Aurelio Ranzato Christopher Poultney Sumit Chopra Yann LeCun Courant Institute of Mathematical Sciences New York University,

More information

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network

Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network Distributed Sensor Networks Volume 2015, Article ID 157453, 7 pages http://dx.doi.org/10.1155/2015/157453 Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network

More information

Why Does Unsupervised Pre-training Help Deep Learning?

Why Does Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Research 11 (2010) 625-660 Submitted 8/09; Published 2/10 Why Does Unsupervised Pre-training Help Deep Learning? Dumitru Erhan Yoshua Bengio Aaron Courville Pierre-Antoine Manzagol

More information

called Restricted Boltzmann Machines for Collaborative Filtering

called Restricted Boltzmann Machines for Collaborative Filtering Restricted Boltzmann Machines for Collaborative Filtering Ruslan Salakhutdinov rsalakhu@cs.toronto.edu Andriy Mnih amnih@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu University of Toronto, 6 King

More information

Learning to Process Natural Language in Big Data Environment

Learning to Process Natural Language in Big Data Environment CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview

More information

Deep learning applications and challenges in big data analytics

Deep learning applications and challenges in big data analytics Najafabadi et al. Journal of Big Data (2015) 2:1 DOI 10.1186/s40537-014-0007-7 RESEARCH Open Access Deep learning applications and challenges in big data analytics Maryam M Najafabadi 1, Flavio Villanustre

More information

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)

More information

Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/

Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/ Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Lecture 6: Classification & Localization. boris. ginzburg@intel.com Lecture 6: Classification & Localization boris. ginzburg@intel.com 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014

More information

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Accelerated Parallel Optimization Methods for Large Scale Machine Learning Accelerated Parallel Optimization Methods for Large Scale Machine Learning Haipeng Luo Princeton University haipengl@cs.princeton.edu Patrick Haffner and Jean-François Paiement AT&T Labs - Research {haffner,jpaiement}@research.att.com

More information

Flexible, High Performance Convolutional Neural Networks for Image Classification

Flexible, High Performance Convolutional Neural Networks for Image Classification Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification Dan C. Cireşan, Ueli Meier,

More information

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction : Explicit Invariance During Feature Extraction Salah Rifai (1) Pascal Vincent (1) Xavier Muller (1) Xavier Glorot (1) Yoshua Bengio (1) (1) Dept. IRO, Université de Montréal. Montréal (QC), H3C 3J7, Canada

More information

A previous version of this paper appeared in Proceedings of the 26th International Conference on Machine Learning (Montreal, Canada, 2009).

A previous version of this paper appeared in Proceedings of the 26th International Conference on Machine Learning (Montreal, Canada, 2009). doi:10.1145/2001269.2001295 Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks By Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng Abstract There

More information

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

More information

Group Sparse Coding. Fernando Pereira Google Mountain View, CA pereira@google.com. Dennis Strelow Google Mountain View, CA strelow@google.

Group Sparse Coding. Fernando Pereira Google Mountain View, CA pereira@google.com. Dennis Strelow Google Mountain View, CA strelow@google. Group Sparse Coding Samy Bengio Google Mountain View, CA bengio@google.com Fernando Pereira Google Mountain View, CA pereira@google.com Yoram Singer Google Mountain View, CA singer@google.com Dennis Strelow

More information

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 3 Object detection

More information

Deep Learning of Invariant Features via Simulated Fixations in Video

Deep Learning of Invariant Features via Simulated Fixations in Video Deep Learning of Invariant Features via Simulated Fixations in Video Will Y. Zou 1, Shenghuo Zhu 3, Andrew Y. Ng 2, Kai Yu 3 1 Department of Electrical Engineering, Stanford University, CA 2 Department

More information

Sparse Online Learning via Truncated Gradient

Sparse Online Learning via Truncated Gradient Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Pedestrian Detection with RCNN

Pedestrian Detection with RCNN Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University mcc17@stanford.edu Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

A Practical Guide to Training Restricted Boltzmann Machines

A Practical Guide to Training Restricted Boltzmann Machines Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Geoffrey Hinton 2010. August 2, 2010 UTML

More information

Unsupervised Feature Learning and Deep Learning

Unsupervised Feature Learning and Deep Learning Unsupervised Feature Learning and Deep Learning Thanks to: Adam Coates Quoc Le Honglak Lee Andrew Maas Chris Manning Jiquan Ngiam Andrew Saxe Richard Socher Develop ideas using Computer vision Audio Text

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Erjin Zhou zej@megvii.com Zhimin Cao czm@megvii.com Qi Yin yq@megvii.com Abstract Face recognition performance improves rapidly

More information

Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than

More information

Deep Learning for Big Data

Deep Learning for Big Data Deep Learning for Big Data Yoshua Bengio Département d Informa0que et Recherche Opéra0onnelle, U. Montréal 30 May 2013, Journée de la recherche École Polytechnique, Montréal Big Data & Data Science Super-

More information

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy

More information

Machine learning challenges for big data

Machine learning challenges for big data Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK

LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK Proceedings of the IASTED International Conference Artificial Intelligence and Applications (AIA 2013) February 11-13, 2013 Innsbruck, Austria LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK Wannipa

More information

arxiv:1312.6034v2 [cs.cv] 19 Apr 2014

arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,

More information

Image denoising: Can plain Neural Networks compete with BM3D?

Image denoising: Can plain Neural Networks compete with BM3D? Image denoising: Can plain Neural Networks compete with BM3D? Harold C. Burger, Christian J. Schuler, and Stefan Harmeling Max Planck Institute for Intelligent Systems, Tübingen, Germany http://people.tuebingen.mpg.de/burger/neural_denoising/

More information

Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach

Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach Xavier Glorot (1) glorotxa@iro.montreal.ca Antoine Bordes (2,1) bordesan@hds.utc.fr Yoshua Bengio (1) bengioy@iro.montreal.ca

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky University of Toronto kriz@cs.utoronto.ca Ilya Sutskever University of Toronto ilya@cs.utoronto.ca Geoffrey E. Hinton University

More information

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning : Intro. to Computer Vision Deep learning University of Massachusetts, Amherst April 19/21, 2016 Instructor: Subhransu Maji Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam Tuesday, May

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information

Compacting ConvNets for end to end Learning

Compacting ConvNets for end to end Learning Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

To Recognize Shapes, First Learn to Generate Images

To Recognize Shapes, First Learn to Generate Images Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Geoffrey Hinton 2006. October 26, 2006

More information

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Taking Inverse Graphics Seriously

Taking Inverse Graphics Seriously CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Methods and Applications for Distance Based ANN Training

Methods and Applications for Distance Based ANN Training Methods and Applications for Distance Based ANN Training Christoph Lassner, Rainer Lienhart Multimedia Computing and Computer Vision Lab Augsburg University, Universitätsstr. 6a, 86159 Augsburg, Germany

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

What is the Best Multi-Stage Architecture for Object Recognition?

What is the Best Multi-Stage Architecture for Object Recognition? What is the Best Multi-Stage Architecture for Object Recognition? Kevin Jarrett, Koray Kavukcuoglu, Marc Aurelio Ranzato and Yann LeCun The Courant Institute of Mathematical Sciences New York University,

More information

TRAFFIC sign recognition has direct real-world applications

TRAFFIC sign recognition has direct real-world applications Traffic Sign Recognition with Multi-Scale Convolutional Networks Pierre Sermanet and Yann LeCun Courant Institute of Mathematical Sciences, New York University {sermanet,yann}@cs.nyu.edu Abstract We apply

More information

Generating more realistic images using gated MRF s

Generating more realistic images using gated MRF s Generating more realistic images using gated MRF s Marc Aurelio Ranzato Volodymyr Mnih Geoffrey E. Hinton Department of Computer Science University of Toronto {ranzato,vmnih,hinton}@cs.toronto.edu Abstract

More information

Deep Convolutional Inverse Graphics Network

Deep Convolutional Inverse Graphics Network Deep Convolutional Inverse Graphics Network Tejas D. Kulkarni* 1, William F. Whitney* 2, Pushmeet Kohli 3, Joshua B. Tenenbaum 4 1,2,4 Massachusetts Institute of Technology, Cambridge, USA 3 Microsoft

More information

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization. Based on Mark Schmidt s slides Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

More information

PhD in Computer Science and Engineering Bologna, April 2016. Machine Learning. Marco Lippi. marco.lippi3@unibo.it. Marco Lippi Machine Learning 1 / 80

PhD in Computer Science and Engineering Bologna, April 2016. Machine Learning. Marco Lippi. marco.lippi3@unibo.it. Marco Lippi Machine Learning 1 / 80 PhD in Computer Science and Engineering Bologna, April 2016 Machine Learning Marco Lippi marco.lippi3@unibo.it Marco Lippi Machine Learning 1 / 80 Recurrent Neural Networks Marco Lippi Machine Learning

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Big Data in the Mathematical Sciences

Big Data in the Mathematical Sciences Big Data in the Mathematical Sciences Wednesday 13 November 2013 Sponsored by: Extract from Campus Map Note: Walk from Zeeman Building to Arts Centre approximately 5 minutes ZeemanBuilding BuildingNumber38

More information

Large Linear Classification When Data Cannot Fit In Memory

Large Linear Classification When Data Cannot Fit In Memory Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan b93107@csie.ntu.edu.tw Kai-Wei Chang Dept. of Computer

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

Classification using Logistic Regression

Classification using Logistic Regression Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function

More information

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco Optimizing content delivery through machine learning James Schneider Anton DeFrancesco Obligatory company slide Our Research Areas Machine learning The problem Prioritize import information in low bandwidth

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information