IMPROVING DEEP NEURAL NETWORK ACOUSTIC MODELS USING GENERALIZED MAXOUT NETWORKS. Xiaohui Zhang, Jan Trmal, Daniel Povey, Sanjeev Khudanpur
|
|
|
- Amanda Hodge
- 9 years ago
- Views:
Transcription
1 IMPROVING DEEP NEURAL NETWORK ACOUSTIC MODELS USING GENERALIZED MAXOUT NETWORKS Xiaohui Zhang, Jan Trmal, Daniel Povey, Sanjeev Khudanpur Center for Language and Speech Processing & Human Language Technology Center of Excellence The Johns Hopkins University, Baltimore, MD 21218, USA ABSTRACT Recently, maxout networks have brought significant improvements to various speech recognition and computer vision tasks. In this paper we introduce two new types of generalized maxout units, which we call p-norm and soft-maxout. We investigate their performance in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages with 10 hours and 60 hours of data, and find that the p-norm generalization of maxout consistently performs well. Because, in our training setup, we sometimes see instability during training when training unbounded-output nonlinearities such as these, we also present a method to control that instability. This is the normalization layer, which is a nonlinearity that scales down all dimensions of its input in order to stop the average squared output from exceeding one. The performance of our proposed nonlinearities are compared with maxout, rectified linear units (ReLU), tanh units, and also with a discriminatively trained SGMM/HMM system, and our p-norm units with p equal to 2 are found to perform best. Index Terms Maxout Networks, Acoustic Modeling, Deep Learning, Speech Recognition 1. INTRODUCTION Following the recent success of pre-trained deep neural networks based on sigmoidal units [1, 2, 3] and the popularity of deep learning, a number of different nonlinearities (activation functions) have been proposed for neural network training. A nonlinearity that has recently become popular is the Rectified Linear Unit (ReLU) [4], which is a simple activation function y = max(0, x). Significant performance gain is reported in [4] and [5], where ReLU networks, without pre-training, outperform the standard sigmoidal networks. More recently, the maxout nonlinearity [6], which can be regarded as a generalization of ReLU, was proposed. This is a function y = max ix i that takes the maximum over groups of inputs which are arranged in groups of, say, 3. Maxout networks, combined with dropout [7], have given state-of-the-art performance in various computer vision tasks [6], and have also achieved improvements in speech recognition tasks [8, 9]. In this this paper, we present two dimension-reducing nonlinearities that are inspired by maxout. One is a soft-maxout, Thanks to Shangsi Wang for useful discussions on the p-norm operation. The authors were supported by DARPA BOLT contract No HR C- 0015, and IARPA BABEL contract No W911NF-12-C The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government. formed by replacing the max function with the function y = log( i exp(xi)). The other one is p-norm, which is the nonlinearityy = x p = ( i xi p ) 1/p, where the vector x represents a small group of inputs (say, 5). Note: if all thex i were known to be positive, the original maxout would be equivalent to the p-norm with p =. It should also be noted that the p-norm pooling strategy has been used for learning image features [10, 11, 12], and a recent work [13] has proposed a learned-norm pooling strategy for deep feedforward and recurrent neural networks. In Section 2 we describe our baseline DNN training recipe. In Section 3, we describe our proposed nonlinearities. In Section 4 we describe our experimental setup; and in Sections 5 and 6 we give experiments and conclusions. 2. OUR DNN RECIPE In this section we explain key features of our baseline DNN training recipe. This recipe is part of the Kaldi ASR toolkit [14]. In order to avoid confusion we should explain that Kaldi currently contains two parallel implementations for DNN training. Both of these recipes are deep neural networks where the last (output) layer is a softmax layer whose output dimension equals the number of context-dependent states in the system (typically several thousand). The neural net is trained to predict the posterior probability of each context-dependent state. During decoding the output probabilities are divided by the prior probability of each state to form a pseudo-likelihood that is used in place of the state emission probabilities in the HMM First Kaldi DNN implementation The first implementation 1 is as described in [15]. This implementation supports Restricted Boltzmann Machines (RBM) pretraining [1, 2, 3], stochastic gradient descent training using NVidia graphics processing units (GPUs), and discriminative training such as boosted MMI [16] and state-level minimum Bayes risk (smbr) [17, 18] Second Kaldi DNN implementation The work done in this paper was done using the second implementation of DNNs in Kaldi 2. This recipe was originally written to support parallel training on multiple CPUs, although it has now been extended to support parallel GPU-based training. All our work here was performed on CPUs, in parallel. It does not support discriminative training. 1 Location in code: src/{nnet,nnetbin}, in scripts: local/run dnn.sh 2 Location in code: src/{nnet-cpu,nnet-cpubin}/, soon to be moved to: src/{nnet2,nnet2bin}/. Location in scripts: local/run nnet cpu.sh, soon to be moved to: local/run nnet2.sh
2 Greedy layer-wise supervised training This recipe does not support Restricted Boltzmann Machine pretraining. Instead we do something similar to the greedy layer-wise supervised training [19] or the layer-wise backpropagation of [3]. We initialize the network randomly with one hidden layer, train it for a short time (typically less than an epoch, meaning less than one fullpass through the data), then remove the layer of weights that go to the softmax layer, add a new hidden layer and two sets of randomly initialized weights, and train again. This is repeated until we have the desired number of layers Use of multiple CPUs Our parallelization of the neural network training has two levels: we parallelize on a single machine, and also across machines (by machine we mean a single physical server with shared memory and multiple CPUs). In a typical setup we run on 16 machines, each running 16 threads, so we have 256 CPUs running in total. The training time for 80 hours of data is about 48 hours. The parallelization method on a single machine is to have multiple threads simultaneously updating the parameters while simply ignoring any synchronization issues. This is similar to the Hogwild! approach [20]. Another approach that we considered was to use a fairly large minibatch size and to use a multi-threaded implementation of BLAS. However, we tried this and found the speedup was much less than linear in the number of CPUs used. We will now explain the parallelization method we use across machines. It does not matter, for this method, whether the individual machines are using a single CPU or multiple CPUs. Our method is to have multiple machines train independently using SGD, on different random subsets of the data. After processing a specified amount of data (typically samples per machine, which can take minutes when using CPUs), each machine writes its model to disk and we average the model parameters. The averaged model parameters become the starting point for the next iteration of training. Because we average model parameters, rather than gradients as in the normal approaches [21], we only need to average the models quite infrequently. We have found that the lack of convexity of the neural network objective function is simply not an issue. However, for this method to make fast progress we must use a higher learning rate than we would use if we were training on a single machine, and this can sometimes lead to parameter divergence or saturation of the sigmoidal units. Our parallel model training method tends to give a small degradation in WER when compared with training on a single machine, but we do it anyway because it is much faster Methods to stabilize training We use a number of methods to help keep the training stable and to stop the neurons from becoming over-saturated when using sigmoidal units. We can only summarize these here Preconditioning the SGD Preconditioned SGD means SGD where, instead of using a fixed learning rate, we use some arbitrary matrix-valued learning rate where the matrix is symmetric positive definite. This matrix can either be fixed or can vary in some random way, although to be able to prove convergence it will generally be necessary to have upper and lower bounds on its eigenvalues that decrease in a suitable way during training. Also, this matrix should not depend on the current sample or it may bias the direction taken by the learning algorithm. There are reasons from information geometry to think that this matrix should be a multiple of the inverse of a multiple of the Fisher information matrix [22]. However, this is a very high-dimensional matrix. Previous work has used a diagonal approximation [23]; our approach is a full-matrix approximation, but factored in a special way. Essentially, for the weights of each layer, we have a matrix-valued factor corresponding to the input dimension and one corresponding to the output dimension, which scale the input values and the output derivatives respectively. For each sample we train, and for each of these factors, we derive the matrix from the inverse of the Fisher information matrix computed over the other members of the minibatch, smoothed using a multiple of the identity matrix. This can be implemented much more efficiently than one might think. We multiply the resulting matrices by a constant for each (minibatch and layer) that is computed to ensure that the total parameter-change is roughly the same as what it would have been if we were not using this method. We find that this method improves convergence as well as improving stability when learning rates are high Maximum change per minibatch We enforce a maximum change in the parameters per minibatch. For each layer, we scale the parameter-change to ensure that the sum over the examples in the minibatch, of the 2-norm change in parameters due to that example, is no more than a constant (e.g. 20). We formulate it in this slightly un-intuitive way in order to avoid having to store a temporary matrix Miscellaneous methods In the following, note that an iteration of training corresponds to however long it takes for each of the (say) 16 machines to process a pre-specified number of samples typically or so. The number of epochs is fixed in advance, typically to around 20, and this together with the total amount of data and the number of machines we are using in parallel determines the number of iterations; the resulting number of iterations can vary between about 50 and 500 depending on the amount of training data Learning rates The initial and final learning rates in our training setup must be specified by hand, and during training we decrease them exponentially, except for a few epochs at the end (typically 5) during which we keep them constant. In [24] an exponentially decreasing schedule was also found to work best, although it is not supported by theory. We should note that whatever the specified global learning rate is, we apply half that learning rate to the last and second-to-last layers of weights. We experimented with various automatic ways to set the learning rates but found it very hard to do so robustly Parameter Shrinking and fixing These methods are only applicable when using sigmoid-type units. After each iteration we do either shrinking or fixing, on an alternating schedule. Shrinking means scaling the parameters of each layer by a separate factor, with the factors being determined by nonlinear optimization over a subset of the training data (we found this worked better than using validation data). We call it shrinking because we expect that the resulting scaling factors will generally be less than one (although this is not always the case). Fixing is an operation where we check whether neurons are over-saturated, and scales down the parameters for that neuron if so. Over-saturated is defined as the derivative at the nonlinearity, averaged over trainingdata samples, being lower than a threshold.
3 Mixing-up In a method inspired by Gaussian mixture models, about halfway through training we increase the dimension of the final output layer (the softmax layer) by letting each output class s probability be a sum over potentially multiple mixture components. The mixture components are distributed using a power rule, proportional to the class priors. During mixing up, we copy and then perturb the parameters that we had before mixing up, and then modify the bias term so that the total posterior probability assigned to each class is roughly the same as before. The average number of new mixtures we assign per state is configuration-dependent, but generally about two or three. The WER improvement from this is quite small, and not always consistent Model combination After the final iteration of training, we take the models from the last n iterations and combine them via a weighted-average operation into a single model. The weights are determined via nonlinear optimization, optimizing the cross-entropy on a randomly selected subset of the training data (again, we found this to perform better than using the validation data). Each layer is weighted separately, so the number of parameters being optimized is equal tontimes the number of layers. This gives a consistent improvement in WER. 3. MAXOUT NETWORKS AND OUR GENERALIZATIONS In a maxout network, the nonlinearity is dimension-reducing. Suppose we have K maxout units (e.g. K = 500) with group size G (e.g. G = 5), then in this example the maxout nonlinearity would reduce the dimension from 2500 to 500. For each group of 5 neurons, the output would be the maximum of all the inputs: y = max G xi (1) i=1 Our generalizations are soft-maxout: G y = log exp(x i) (2) and the p-norm: i=1 ( y = x p = x i p) 1/p. (3) The value of p is configurable; our experiments favorp=2. Another nonlinearity we experiment with here is the ReLU nonlinearity: i y = max(x,0). (4) WER norm 8 norm 15 norm soft Maxout Maxout # hidden layers Fig. 1. Tuning #layers for maxout and variants: Bengali LimitedLP 1 i K. Compute σ = 1/K i x2 i which is the uncentered standard deviation of the x i. The nonlinearity is: y i = { xi, σ 1 x i/σ, σ > 1 That is, it scales down the whole set of activations if necessary to prevent the standard deviation from exceeding 1. Although this is intended to stabilize training, for consistency we apply it in both training and test conditions. We use normalization layers for all the unbounded-output nonlinearities (i.e. ReLU, and maxout and its variants) Previous related work We should note that soft versions of ReLU have already been investigated [25, 5]. Besides the softplus function used in [25], a leaky version of ReLU that allows for a small non-zero derivative when the unit is inactive, was tried in [5]. Neither modification resulted in a performance improvement. 4. EXPERIMENTAL SETUP We test on a number of languages. In each language there are two evaluation conditions: LimitedLP, with 10 hours of training data, and FullLP, with 60 or 80 hours of training data. We measure based on Word Error Rate (WER), and Actual Term Weighted Value (ATWV), for which higher is better. Our baselines are our standard DNN system with tanh activations, and a SGMM [26] system trained with Boosted MMI (bmmi) [16]. Both are trained on top of features adapted with Constrained MLLR/fMLLR [27]. (5) All of these nonlinearities have unbounded output. This can lead to instability in training. What we generally observe is that after the objective function increases steadily for a number of epochs, it suddenly becomes very negative. When looking into the output of the parallel training runs on a particular iteration, we would observe that often the bad performance was limited to one or several of the runs. We tried a few methods to fix this, and the one we settled on was the following Normalization layers A normalization layer is nonlinearity that goes from some dimension K to K. We apply it directly after the nonlinearities discussed above, without a layer of weights in between. Let the input be x i, with Fig. 2. Effect of p and the group size for p-norm units.
4 In this paper, we experiment with Babel Assamese 3, Bengali 4, Haitian Creole 5 and Zulu 6 databases, mainly on limitedlp (10 hours of training speech) setup. We used the Bengali fulllp (60 hours of training speech) setup for experiments with group size G for the Maxout networks. The LimitedLP systems had around 2000 context-dependent states and the Bengali FullLP system had The base features used were PLP plus pitch and probability of voicing features based on SAcC [28], for all languages except Haitian Creole where we also added FFV features [29]. Note: for evaluation purposes we used the performer provided keywords releases for Assamese and Bengali. As these keyword sets weren t available at the time of writing this paper for Zulu and Haitian Creole, we generated our own development set of keywords for these two languages, using a criterion based on mutual information. 5. EXPERIMENTS Figure 1 shows a tuning experiment where we varied the number of layers for a Bengali LimitedLP experiment. Here, we used a group size G = 10 and a number of groups K = 290 in each case (this was tuned to give about 3 million parameters in the 2-layer case). The optimal number of hidden layers seems to be lower for 2-norm (at 3 layers) than for the other nonlinearities (at around 5). The left plot in Figure 2 shows further tuning experiments, with the same experimental setup as Figure 1, with two hidden layers. Here we vary the value of p in the p-norm, and find that p = 2 works best. The right plot in the figure shows the effect of varying the group sizeg, using the same data, two layers,p=2, and keeping the number of parameters fixed at 3 million. A group size G=10 seems to work best. Note that in [8], a group size of G = 2 worked best, but in that case the number of parameters was being decreased as the group size was increased. We also tuned the initial and learning rates (experiments not shown). For all maxout variants, we used ; for ReLU we used , and for tanh units we used for LimitedLP and for FullLP. In Figures 3 and 4 we compare the various nonlinearity types across a number of languages, for the LimitedLP setting (10 hours of training data). All neural networks have 2 layers; unfortunately this is probably not the optimal setting. We used a group sizeg=10. We kept the number of parameters for all neural nets roughly fixed at around 3 million. We also compare with the baseline SGMM+bMMI system. A fairly consistent trend can be seen, with the 2-norm giving the best performance. In Figure 5 we repeated the comparison with the FullLP Bengali data, this time using 4 layers and around 12 million parameters, where the 2-norm performed best in WER. Fig. 3. Comparing nonlinearity types (%WER, LimitedLP) Fig. 4. Comparing nonlinearity types (ATWV, LimitedLP) 6. CONCLUSIONS In this paper, we proposed two new generalized versions of Maxout nonlinearities, namely soft-maxout and p-norm units. In experiments on a number of languages, we find that the p-norm units with p=2 perform consistently better than the activations which we used as baselines (various versions of Maxout, and tanh and ReLU). The p-norm units also seem to perform better with a smaller number of parameters and layers than the other nonlinearities. 3 Language collection releaseiarpa-babel102b-v Language collection releaseiarpa-babel103b-v Language collection releaseiarpa-babel201b-v0.2b. 6 Language collection releaseiarpa-babel206b-v0.1d. Fig. 5. Comparing nonlinearity types (Bengali FullLP)
5 7. REFERENCES [1] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Signal Processing Magazine, IEEE, vol. 29, no. 6, pp , [2] George E Dahl, Dong Yu, Li Deng, and Alex Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp , [3] Frank Seide, Gang Li, and Dong Yu, Conversational speech transcription using context-dependent deep neural networks., in INTERSPEECH, 2011, pp [4] MD Zeiler, M Ranzato, R Monga, M Mao, K Yang, QV Le, P Nguyen, A Senior, V Vanhoucke, J Dean, et al., On rectified linear units for speech processing, in Proc. ICASSP, [5] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, Rectifier nonlinearities improve neural network acoustic models, in ICML Workshop on Deep Learning for Audio, Speech, and Language Processing (WDLASL 2013), [6] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio, Maxout networks, arxiv preprint arxiv: , [7] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arxiv preprint arxiv: , [8] Y. Miao, S. Rawat, and F. Metze, Deep maxout networks for low resource speech recognition, in Proc. ASRU, [9] M. Cai, Y. Shi, and J. Liu, Deep maxout neural networks for speech recognition, in Proc. ASRU, [10] Koray Kavukcuoglu, M Ranzato, Rob Fergus, and Yann Le- Cun, Learning invariant features through topographic filter maps, in Computer Vision and Pattern Recognition, CVPR IEEE Conference on. IEEE, 2009, pp [11] Y-Lan Boureau, Jean Ponce, and Yann LeCun, A theoretical analysis of feature pooling in visual recognition, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp [12] Pierre Sermanet, Soumith Chintala, and Yann LeCun, Convolutional neural networks applied to house numbers digit classification, in Pattern Recognition (ICPR), st International Conference on. IEEE, 2012, pp [13] Caglar Gulcehre, Kyunghyun Cho, Razvan Pascanu, and Yoshua Bengio, Learned-norm pooling for deep neural networks, arxiv preprint arxiv: , [14] D. Povey, A. Ghoshal, et al., The Kaldi Speech Recognition Toolkit, in Proc. ASRU, [15] Karel Veselỳ, Arnab Ghoshal, Lukáš Burget, and Daniel Povey, Sequence-discriminative training of deep neural networks, in Interspeech, [16] D. Povey and D. Kanevsky and B. Kingsbury and B. Ramabhadran and G. Saon and K. Visweswariah, Boosted MMI for Feature and Model Space Discriminative Training, in ICASSP, [17] Gibson M. and Hain T., Hypothesis Spaces For Minimum Bayes Risk Training In Large Vocabulary Speech Recognition, in Interspeech, [18] Daniel Povey and Brian Kingsbury, Evaluation of proposed modifications to MPE for large scale discriminative training, in ICASSP, [19] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle, Greedy layer-wise training of deep networks, Advances in neural information processing systems (NIPS), vol. 19, pp. 153, [20] Feng Niu, Benjamin Recht, Christopher Ré, and Stephen J Wright, Hogwild!: A lock-free approach to parallelizing stochastic gradient descent, arxiv preprint arxiv: , [21] Jeffrey Dean and Greg S. Corrado and Rajat Monga and Kai Chen and Matthieu Devin and Quoc V. Le and Mark Z. Mao and Marc Aurelio Ranzato and Andrew Senior and Paul Tucker and Ke Yang and Andrew Y. Ng, Large Scale Distributed Deep Networks, in Neural Information Processing Systems (NIPS), [22] Noboru Murata and Shun-ichi Amari, Statistical analysis of learning dynamics, Signal Processing, vol. 74, no. 1, pp. 3 28, [23] Elad Hazan, Alexander Rakhlin, and Peter L Bartlett, Adaptive online gradient descent, in Advances in Neural Information Processing Systems (NIPS), 2007, pp [24] Andrew Senior, Georg Heigold, Marc Aurelio Ranzato, and Ke Yang, An empirical study of learning rates in deep neural networks for speech recognition, in Acoustics, Speech and Signal Processing (ICASSP), [25] Xavier Glorot, Antoine Bordes, and Yoshua Bengio, Deep sparse rectifier networks, in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP Volume, 2011, vol. 15, pp [26] D. Povey, L. Burget, et al., The Subspace Gaussian Mixture Model A Structured Model for Speech Recognition, Computer Speech & Language, vol. 25, no. 2, pp , April [27] M. J. F. Gales and P. C. Woodland, Mean and Variance Adaptation Within the MLLR Framework, Computer Speech and Language, vol. 10, pp , [28] Daniel PW Ellis and Byunk Suk Lee, Noise robust pitch tracking by subband autocorrelation classification, in 13th Annual Conference of the International Speech Communication Association, [29] Kornel Laskowski, Mattias Heldner, and Jens Edlund, The fundamental frequency variation spectrum, in FONETIK, 2008.
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane
OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane Carnegie Mellon University Language Technology Institute {ankurgan,fmetze,ahw,lane}@cs.cmu.edu
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
arxiv:1505.05899v1 [cs.cl] 21 May 2015
The IBM 2015 English Conversational Telephone Speech Recognition System George Saon, Hong-Kwang J. Kuo, Steven Rennie and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION
KL-DIVERGENCE REGULARIZED DEEP NEURAL NETWORK ADAPTATION FOR IMPROVED LARGE VOCABULARY SPEECH RECOGNITION Dong Yu 1, Kaisheng Yao 2, Hang Su 3,4, Gang Li 3, Frank Seide 3 1 Microsoft Research, Redmond,
Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks
Volume 143 - No.11, June 16 Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks Mohamed Akram Zaytar Research Student Department of Computer Engineering Faculty
Compacting ConvNets for end to end Learning
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
arxiv:1604.08242v2 [cs.cl] 22 Jun 2016
The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 [email protected]
Tutorial on Deep Learning and Applications
NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning Tutorial on Deep Learning and Applications Honglak Lee University of Michigan Co-organizers: Yoshua Bengio, Geoff Hinton, Yann LeCun,
Strategies for Training Large Scale Neural Network Language Models
Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,
Do Deep Nets Really Need to be Deep?
Do Deep Nets Really Need to be Deep? Lei Jimmy Ba University of Toronto [email protected] Rich Caruana Microsoft Research [email protected] Abstract Currently, deep neural networks are the state
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
Marc'Aurelio Ranzato
Marc'Aurelio Ranzato Facebook Inc. 1 Hacker Way Menlo Park, CA 94025, USA email: [email protected] web: www.cs.toronto.edu/ ranzato RESEARCH INTERESTS My primary research interests are in the area
Deep Learning using Linear Support Vector Machines
Yichuan Tang [email protected] Department of Computer Science, University of Toronto. Toronto, Ontario, Canada. Abstract Recently, fully-connected and convolutional neural networks have been trained
Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
Distributed Sensor Networks Volume 2015, Article ID 157453, 7 pages http://dx.doi.org/10.1155/2015/157453 Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network
arxiv:1412.5567v2 [cs.cl] 19 Dec 2014
Deep Speech: Scaling up end-to-end speech recognition arxiv:1412.5567v2 [cs.cl] 19 Dec 2014 Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh,
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University [email protected] Rob Fergus Department
arxiv:1312.6034v2 [cs.cv] 19 Apr 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,
On the Number of Linear Regions of Deep Neural Networks
On the Number of Linear Regions of Deep Neural Networks Guido Montúfar Max Planck Institute for Mathematics in the Sciences [email protected] Razvan Pascanu Université de Montréal [email protected]
On Optimization Methods for Deep Learning
Quoc V. Le [email protected] Jiquan Ngiam [email protected] Adam Coates [email protected] Abhik Lahiri [email protected] Bobby Prochnow [email protected] Andrew Y. Ng [email protected]
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
Manifold Learning with Variational Auto-encoder for Medical Image Analysis
Manifold Learning with Variational Auto-encoder for Medical Image Analysis Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill [email protected] Abstract Manifold
Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning
: Intro. to Computer Vision Deep learning University of Massachusetts, Amherst April 19/21, 2016 Instructor: Subhransu Maji Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam Tuesday, May
CAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Jan 26, 2016 Today Administrivia A bigger picture and some common questions Object detection proposals, by Samer
An Introduction to Deep Learning
Thought Leadership Paper Predictive Analytics An Introduction to Deep Learning Examining the Advantages of Hierarchical Learning Table of Contents 4 The Emergence of Deep Learning 7 Applying Deep-Learning
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION
LARGE SCALE DEEP NEURAL NETWORK ACOUSTIC MODELING WITH SEMI-SUPERVISED TRAINING DATA FOR YOUTUBE VIDEO TRANSCRIPTION Hank Liao, Erik McDermott, and Andrew Senior Google Inc. {hankliao, erikmcd, andrewsenior}@google.com
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS George E. Dahl University of Toronto Department of Computer Science Toronto, ON, Canada Jack W. Stokes, Li Deng, Dong Yu
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
Denoising Convolutional Autoencoders for Noisy Speech Recognition
Denoising Convolutional Autoencoders for Noisy Speech Recognition Mike Kayser Stanford University [email protected] Victor Zhong Stanford University [email protected] Abstract We propose the use of
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee [email protected] [email protected] Abstract We
Chapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
Visualizing Higher-Layer Features of a Deep Network
Visualizing Higher-Layer Features of a Deep Network Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent Dept. IRO, Université de Montréal P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7,
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Dan C. Cireşan IDSIA USI-SUPSI Manno, Switzerland, 6928 Email: [email protected] Ueli Meier IDSIA USI-SUPSI Manno, Switzerland, 6928
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy
Bert Huang Department of Computer Science Virginia Tech
This paper was submitted as a final project report for CS6424/ECE6424 Probabilistic Graphical Models and Structured Prediction in the spring semester of 2016. The work presented here is done by students
arxiv:1502.01852v1 [cs.cv] 6 Feb 2015
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun arxiv:1502.01852v1 [cs.cv] 6 Feb 2015 Abstract Rectified activation
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine George E. Dahl, Marc Aurelio Ranzato, Abdel-rahman Mohamed, and Geoffrey Hinton Department of Computer Science University of Toronto
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition Marc Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, Yann LeCun Courant Institute of Mathematical Sciences,
Taking Inverse Graphics Seriously
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best
Opinion Mining with Deep Recurrent Neural Networks
Opinion Mining with Deep Recurrent Neural Networks Ozan İrsoy and Claire Cardie Department of Computer Science Cornell University Ithaca, NY, 14853, USA oirsoy, [email protected] Abstract Recurrent
An Introduction to Neural Networks
An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Self Organizing Maps: Fundamentals
Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen
LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK
Proceedings of the IASTED International Conference Artificial Intelligence and Applications (AIA 2013) February 11-13, 2013 Innsbruck, Austria LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK Wannipa
Unsupervised and Transfer Learning Challenge: a Deep Learning Approach
JMLR: Workshop and Conference Proceedings 27:97 111, 2012 Workshop on Unsupervised and Transfer Learning Unsupervised and Transfer Learning Challenge: a Deep Learning Approach Grégoire Mesnil 1,2 [email protected]
An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048
An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048 Hong Gui, Tinghan Wei, Ching-Bo Huang, I-Chen Wu 1 1 Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan
Neural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION
LEARNING FEATURE MAPPING USING DEEP NEURAL NETWORK BOTTLENECK FEATURES FOR DISTANT LARGE VOCABULARY SPEECH RECOGNITION Ivan Himawan 1, Petr Motlicek 1, David Imseng 1, Blaise Potard 1, Namhoon Kim 2, Jaewon
Parallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
A Hybrid Neural Network-Latent Topic Model
Li Wan Leo Zhu Rob Fergus Dept. of Computer Science, Courant institute, New York University {wanli,zhu,fergus}@cs.nyu.edu Abstract This paper introduces a hybrid model that combines a neural network with
University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons
University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
arxiv:1312.6062v2 [cs.lg] 9 Apr 2014
Stopping Criteria in Contrastive Divergence: Alternatives to the Reconstruction Error arxiv:1312.6062v2 [cs.lg] 9 Apr 2014 David Buchaca Prats Departament de Llenguatges i Sistemes Informàtics, Universitat
Image Classification for Dogs and Cats
Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science [email protected] Abstract
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates 1, Honglak Lee 2, Andrew Y. Ng 1 1 Computer Science Department, Stanford University {acoates,ang}@cs.stanford.edu 2 Computer
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Lecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network
, pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and
Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition
, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology
Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/
Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object
3 An Illustrative Example
Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions
Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승
Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승 How much energy do we need for brain functions? Information processing: Trade-off between energy consumption and wiring cost Trade-off between energy consumption
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data
Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,
Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines
Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines Marc Aurelio Ranzato Geoffrey E. Hinton Department of Computer Science - University of Toronto 10 King s College Road,
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
NEURAL NETWORKS A Comprehensive Foundation
NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments
Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images
For Modeling Natural Images Marc Aurelio Ranzato Alex Krizhevsky Geoffrey E. Hinton Department of Computer Science - University of Toronto Toronto, ON M5S 3G4, CANADA Abstract Deep belief nets have been
Rectified Linear Units Improve Restricted Boltzmann Machines
Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair [email protected] Geoffrey E. Hinton [email protected] Department of Computer Science, University of Toronto, Toronto, ON
A Practical Guide to Training Restricted Boltzmann Machines
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Geoffrey Hinton 2010. August 2, 2010 UTML
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Bag of Pursuits and Neural Gas for Improved Sparse Coding
Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis
Contemporary Engineering Sciences, Vol. 4, 2011, no. 4, 149-163 Performance Evaluation of Artificial Neural Networks for Spatial Data Analysis Akram A. Moustafa Department of Computer Science Al al-bayt
Generalized Denoising Auto-Encoders as Generative Models
Generalized Denoising Auto-Encoders as Generative Models Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent Département d informatique et recherche opérationnelle, Université de Montréal Abstract
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles [email protected] and lunbo
Follow links Class Use and other Permissions. For more information, send email to: [email protected]
COPYRIGHT NOTICE: David A. Kendrick, P. Ruben Mercado, and Hans M. Amman: Computational Economics is published by Princeton University Press and copyrighted, 2006, by Princeton University Press. All rights
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *
Send Orders for Reprints to [email protected] 766 The Open Electrical & Electronic Engineering Journal, 2014, 8, 766-771 Open Access Research on Application of Neural Network in Computer Network
Deep Convolutional Inverse Graphics Network
Deep Convolutional Inverse Graphics Network Tejas D. Kulkarni* 1, William F. Whitney* 2, Pushmeet Kohli 3, Joshua B. Tenenbaum 4 1,2,4 Massachusetts Institute of Technology, Cambridge, USA 3 Microsoft
called Restricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Ruslan Salakhutdinov [email protected] Andriy Mnih [email protected] Geoffrey Hinton [email protected] University of Toronto, 6 King
Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
: Explicit Invariance During Feature Extraction Salah Rifai (1) Pascal Vincent (1) Xavier Muller (1) Xavier Glorot (1) Yoshua Bengio (1) (1) Dept. IRO, Université de Montréal. Montréal (QC), H3C 3J7, Canada
ARTIFICIAL NEURAL NETWORKS IN THE SCOPE OF OPTICAL PERFORMANCE MONITORING
1 th Portuguese Conference on Automatic Control 16-18 July 212 CONTROLO 212 Funchal, Portugal ARTIFICIAL NEURAL NETWORKS IN THE SCOPE OF OPTICAL PERFORMANCE MONITORING Vítor Ribeiro,?? Mário Lima, António
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra Martin Riedmiller DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}
Lecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
