Supporting Online Material for
|
|
|
- Lesley Jefferson
- 9 years ago
- Views:
Transcription
1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence should be addressed. Published 28 July 2006, Science 313, 504 (2006) DOI: /science This PDF file includes: Materials and Methods Figs. S1 to S5 Matlab Code
2 Supporting Online Material Details of the pretraining: To speed up the pretraining of each RBM, we subdivided all datasets into mini-batches, each containing 100 data vectors and updated the weights after each mini-batch. For datasets that are not divisible by the size of a minibatch, the remaining data vectors were included in the last minibatch. For all datasets, each hidden layer was pretrained for 50 passes through the entire training set. The weights were updated after each mini-batch using the averages in Eq. 1 of the paper with a learning rate of. In addition, times the previous update was added to each weight and times the value of the weight was subtracted to penalize large weights. Weights were initialized with small random values sampled from a normal distribution with zero mean and standard deviation of. The Matlab code we used is available at hinton/matlabforsciencepaper.html Details of the fine-tuning: For the fine-tuning, we used the method of conjugate gradients on larger minibatches containing 1000 data vectors. We used Carl Rasmussen s minimize code (1). Three line searches were performed for each mini-batch in each epoch. To determine an adequate number of epochs and to check for overfitting, we fine-tuned each autoencoder on a fraction of the training data and tested its performance on the remainder. We then repeated the fine-tuning on the entire training set. For the synthetic curves and hand-written digits, we used 200 epochs of fine-tuning; for the faces we used 20 epochs and for the documents we used 50 epochs. Slight overfitting was observed for the faces, but there was no overfitting for the other datasets. Overfitting means that towards the end of training, the reconstructions were still improving on the training set but were getting worse on the validation set. We experimented with various values of the learning rate, momentum, and weight-decay parameters and we also tried training the RBM s for more epochs. We did not observe any significant differences in the final results after the fine-tuning. This suggests that the precise weights found by the greedy pretraining do not matter as long as it finds a good region from which to start the fine-tuning. How the curves were generated: To generate the synthetic curves we constrained the coordinate of each point to be at least greater than the coordinate of the previous point. We also constrained all coordinates to lie in the range. The three points define a cubic spline which is inked to produce the pixel images shown in Fig. 2 in the paper. The details of the inking procedure are described in (2) and the matlab code is at (3). 1
3 Fitting logistic PCA: To fit logistic PCA we used an autoencoder in which the linear code units were directly connected to both the inputs and the logistic output units, and we minimized the cross-entropy error using the method of conjugate gradients. How pretraining affects fine-tuning in deep and shallow autoencoders: Figure S1 compares performance of pretrained and randomly initialized autoencoders on the curves dataset. Figure S2 compares the performance of deep and shallow autoencoders that have the same number of parameters. For all these comparisons, the weights were initialized with small random values sampled from a normal distribution with mean zero and standard deviation. Details of finding codes for the MNIST digits: For the MNIST digits, the original pixel intensities were normalized to lie in the interval. They had a preponderance of extreme values and were therefore modeled much better by a logistic than by a Gaussian. The entire training procedure for the MNIST digits was identical to the training procedure for the curves, except that the training set had 60,000 images of which 10,000 were used for validation. Figure S3 and S4 are an alternative way of visualizing the two-dimensional codes produced by PCA and by an autencoder with only two code units. These alternative visualizations show many of the actual digit images. We obtained our results using an autoencoder with 1000 units in the first hidden layer. The fact that this is more than the number of pixels does not cause a problem for the RBM it does not try to simply copy the pixels as a one-hidden-layer autoencoder would. Subsequent experiments show that if the 1000 is reduced to 500, there is very little change in the performance of the autoencoder. Details of finding codes for the Olivetti face patches: The Olivetti face dataset from which we obtained the face patches contains ten images of each of forty different people. We constructed a dataset of 165, images by rotating ( to ), scaling (1.4 to 1.8), cropping, and subsampling the original 400 images. The intensities in the cropped images were shifted so that every pixel had zero mean and the entire dataset was then scaled by a single number to make the average pixel variance be. The dataset was then subdivided into 124,200 training images, which contained the first thirty people, and 41,400 test images, which contained the remaining ten people. The training set was further split into 103,500 training and 20,700 validation images, containing disjoint sets of 25 and 5 people. When pretraining the first layer of 2000 binary features, each real-valued pixel intensity was modeled by a Gaussian distribution with unit variance. Pretraining this first layer of features required a much smaller learning rate to avoid oscillations. The learning rate was set to and pretraining proceeded for 200 epochs. We used more feature detectors than pixels because a real-valued pixel intensity contains more information than a binary feature activation. Pretraining of the higher layers was carried out as for all other datasets. The ability of the autoencoder to reconstruct more of the perceptually significant, high-frequency details of faces is not fully reflected in the squared pixel error. This is an example of the well-known inadequacy of squared pixel error for assessing perceptual similarity. 2
4 Details of finding codes for the Reuters documents: The 804,414 newswire stories in the Reuters Corpus Volume II have been manually categorized into 103 topics. The corpus covers four major groups: corporate/industrial, economics, government/social, and markets. The labels were not used during either the pretraining or the fine-tuning. The data was randomly split into 402,207 training and 402,207 test stories, and the training set was further randomly split into 302,207 training and 100,000 validation documents. Common stopwords were removed from the documents and the remaining words were stemmed by removing common endings. During the pretraining, the confabulated activities of the visible units were computed using a softmax which is the generalization of a logistic to more than 2 alternatives: (1) where is the modeled probability of word and is the weighted input produced by the feature activations plus a bias term. The use of a softmax does not affect the learning rule (4), but to allow for the fact that the document contains observations from the probability distribution over words, the weight from word to feature was set to be times the weight from the feature to the word. For our comparisons with Latent Semantic Analysis (LSA), we used the version of LSA in which each word count,, is replaced by. This standard preprocessing trick slightly improves the retrieval performance of LSA by down-weighting the influence of very frequent words. A comparison with Local Linear Embedding: It is hard to compare autoencoders with Local Linear Embedding (5) on a reconstruction task because, like several other non-parametric dimensionality reduction methods, LLE does not give a simple way of reconstructing a test image. We therefore compared LLE with autoencoders on a document retrieval task. LLE is not a sensible method to use because it involves finding nearest neighbors in the high-dimensional space for each new query document, but document retrieval does, at least, give us a way of assessing how good the low-dimensional codes are. Fitting LLE takes a time that is quadratic in the number of training cases, so we used the 20 newsgroup corpus that has only 11,314 training and 7,531 test documents. The documents are postings taken from the Usenet newsgroup collection. The corpus is partitioned fairly evenly into 20 different newsgroups, each corresponding to a separate topic. The data was preprocessed, organized by date, and is available at (6). The data was split by date, so the training and test sets were separated in time. Some newsgroups are very closely related to each other, e.g. soc.religion.christian and talk.religion.misc, while others are very different, e.g. rec.sport.hockey and comp.graphics. We removed common stopwords from the documents and also stemmed the remaining words by removing common endings. As for the Reuters corpus, we only considered the 2000 most frequent words in the training dataset. LLE must be given the nearest neighbors of each datapoint and we used the cosines of the angles between count vectors to find the nearest neighbors. The performance of LLE depends on the value chosen for so we tried! #" and we always report 3
5 the results using the best value of. We also checked that K was large enough for the whole dataset to form one connected component (for K=5, there was a disconnected component of 21 documents). During the test phase, for each query document, we identify the nearest count vectors from the training set and compute the best weights for reconstructing the count vector from its neighbors. We then use the same weights to generate the low-dimensional code for from the low-dimensional codes of its high-dimensional nearest neighbors (7). LLE code is available at (8). For 2-dimensional codes, LLE ( ) performs better than LSA but worse than our autoencoder. For higher dimensional codes, the performance of LLE ( ) is very similar to the performance of LSA (see Fig. S5) and much worse than the autoencoder. We also tried normalizing the squared lengths of the document count vectors before applying LLE but this did not help. Using the pretraining and fine-tuning for digit classification: To show that the same pretraining procedure can improve generalization on a classification task, we used the permutation invariant version of the MNIST digit recognition task. Before being given to the learning program, all images undergo the same random permutation of the pixels. This prevents the learning program from using prior information about geometry such as affine transformations of the images or local receptive fields with shared weights (9). On the permutation invariant task, Support Vector Machines achieve 1.4% (10). The best published result for a randomly initialized neural net trained with backpropagation is 1.6% for a network (11). Pretraining reduces overfitting and makes learning faster, so it makes it possible to use a much larger neural network that achieves 1.2%. We pretrained a net for 100 epochs on all 60,000 training cases in the just same way as the autoencoders were pretrained, but with 2000 logistic units in the top layer. The pretraining did not use any information about the class labels. We then connected ten softmaxed output units to the top layer and fine-tuned the whole network using simple gradient descent in the cross-entropy error with a very gentle learning rate to avoid unduly perturbing the weights found by the pretraining. For all but the last layer, the learning rate was " for the weights and for the biases. To speed learning, times the previous weight increment was added to each weight update. For the biases and weights of the 10 output units, there was no danger of destroying information from the pretraining, so their learning rates were five times larger and they also had a penalty which was times their squared magnitude. After 77 epochs of fine-tuning, the average cross-entropy error on the training data fell below a pre-specified threshold value and the fine-tuning was stopped. The test error at that point was %. The threshold value was determined by performing the pretraining and fine-tuning on only 50,000 training cases and using the remaining 10,000 training cases as a validation set to determine the training cross-entropy error that gave the fewest classification errors on the validation set. We have also fine-tuned the whole network using using the method of conjugate gradients (1) on minibatches containing 1000 data vectors. Three line searches were performed for each 4
6 mini-batch in each epoch. After 48 epochs of fine-tuning, the test error was %. The stopping criterion for fine-tuning was determined in the same way as described above. The Matlab code for training such a classifier is available at hinton/matlabforsciencepaper.html Supporting text How the energies of images determine their probabilities: The probability that the model assigns to a visible vector, is (2) where is the set of all possible binary hidden vectors. To follow the gradient of w.r.t. it is necessary to alternate between updating the states of the features and the states of the pixels until the stationary distribution is reached. The expected value of in this stationary distribution is then used instead of the expected value for the one-step confabulations. In addition to being much slower, this maximum likelihood learning procedure suffers from much more sampling noise. To further reduce noise in the learning signal, the binary states of already learned feature detectors (or pixels) in the data layer are replaced by their real-valued probabilities of activation when learning the next layer of feature detectors, but the new feature detectors always have stochastic binary states to limit the amount of information they can convey. The energy function for real-valued data: When using linear visible units and binary hidden units, the energy function and update rules are particularly simple if the linear units have Gaussian noise with unit variance. If the variances are not, the energy function becomes: "! $# &% ' % )(*,+.-0/12! where ' is the standard deviation of the Gaussian noise for visible unit. The stochastic update rule for the hidden units remains the same except that each is divided by '. The update rule for visible units is to sample from a Gaussian with mean # ' and variance ' %. The differing goals of Restricted Boltzmann machines and autoencoders: Unlike an autoencoder, the aim of the pretraining algorithm is not to accurately reconstruct each training image but to make the distribution of the confabulations be the same as the distribution of the images. This goal can be satisfied even if an image A that occurs with probability,3 in the training set is confabulated as image B with probability provided that, for all A,, "4. 5 # ' (3)
7 Supporting figures 20 5 Squared Reconstruction Error Randomly Initialized Autoencoder Pretrained Autoencoder Squared Reconstruction Error Randomly Initialized Autoencoder Pretrained Autoencoder Number of Epochs Number of Epochs Fig. S1: The average squared reconstruction error per test image during fine-tuning on the curves training data. Left panel: The deep autoencoder makes rapid progress after pretraining but no progress without pretraining. Right panel: A shallow autoencoder can learn without pretraining but pretraining makes the fine-tuning much faster, and the pretraining takes less time than 10 iterations of fine-tuning. 4 Squared Reconstruction Error Pretrained Shallow Autoencoder Pretrained Deep Autoencoder Number of Epochs Fig. S2: The average squared reconstruction error per image on the test dataset is shown during the fine-tuning on the curves dataset. A autoencoder performs slightly better than a shallower autoencoder that has about the same number of parameters. Both autoencoders were pretrained. 6
8 Fig. S3: An alternative visualization of the 2-D codes produced by taking the first two principal components of all 60,000 training images. 5,000 images of digits (500 per class) are sampled in random order. Each image is displayed if it does not overlap any of the images that have already been displayed. 7
9 Fig. S4: An alternative visualization of the 2-D codes produced by a autoencoder trained on all 60,000 training images. 5,000 images of digits (500 per class) are sampled in random order. Each image is displayed if it does not overlap any of the images that have already been displayed. 8
10 20 Newsgroup Dataset 20 Newsgroup Dataset 0.6 Autoencoder 10D 0.25 Autoencoder 2D LLE 10D Accuracy LLE 2D Accuracy LSA 10D 0.05 LSA 2D Number of Retrieved Documents Number of Retrieved Documents Fig. S5: Accuracy curves when a query document from the test set is used to retrieve other test set documents, averaged over all 7,531 possible queries. References and Notes 1. For the conjugate gradient fine-tuning, we used Carl Rasmussen s minimize code available at 2. G. Hinton, V. Nair, Advances in Neural Information Processing Systems (MIT Press, Cambridge, MA, 2006). 3. Matlab code for generating the images of curves is available at hinton. 4. G. E. Hinton, Neural Computation 14, 1711 (2002). 5. S. T. Roweis, L. K. Saul, Science 290, 2323 (2000). 6. The 20 newsgroups dataset (called 20news-bydate.tar.gz) is available at 7. L. K. Saul, S. T. Roweis, Journal of Machine Learning Research 4, 119 (2003). 8. Matlab code for LLE is available at roweis/lle/index.html. 9. Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Proceedings of the IEEE 86, 2278 (1998). 10. D. V. Decoste, B. V. Schoelkopf, Machine Learning 46, 161 (2002). 11. P. Y. Simard, D. Steinkraus, J. C. Platt, Proceedings of Seventh International Conference on Document Analysis and Recognition (2003), pp
A Practical Guide to Training Restricted Boltzmann Machines
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Geoffrey Hinton 2010. August 2, 2010 UTML
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Rectified Linear Units Improve Restricted Boltzmann Machines
Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair [email protected] Geoffrey E. Hinton [email protected] Department of Computer Science, University of Toronto, Toronto, ON
To Recognize Shapes, First Learn to Generate Images
Department of Computer Science 6 King s College Rd, Toronto University of Toronto M5S 3G4, Canada http://learning.cs.toronto.edu fax: +1 416 978 1455 Copyright c Geoffrey Hinton 2006. October 26, 2006
called Restricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Ruslan Salakhutdinov [email protected] Andriy Mnih [email protected] Geoffrey Hinton [email protected] University of Toronto, 6 King
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University [email protected] Rob Fergus Department
Simplified Machine Learning for CUDA. Umar Arshad @arshad_umar Arrayfire @arrayfire
Simplified Machine Learning for CUDA Umar Arshad @arshad_umar Arrayfire @arrayfire ArrayFire CUDA and OpenCL experts since 2007 Headquartered in Atlanta, GA In search for the best and the brightest Expert
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks
This version: December 12, 2013 Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks Lawrence Takeuchi * Yu-Ying (Albert) Lee [email protected] [email protected] Abstract We
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Dan C. Cireşan IDSIA USI-SUPSI Manno, Switzerland, 6928 Email: [email protected] Ueli Meier IDSIA USI-SUPSI Manno, Switzerland, 6928
Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial)
UCL Tutorial on: Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial) Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Manifold Learning with Variational Auto-encoder for Medical Image Analysis
Manifold Learning with Variational Auto-encoder for Medical Image Analysis Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill [email protected] Abstract Manifold
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Efficient Learning of Sparse Representations with an Energy-Based Model
Efficient Learning of Sparse Representations with an Energy-Based Model Marc Aurelio Ranzato Christopher Poultney Sumit Chopra Yann LeCun Courant Institute of Mathematical Sciences New York University,
Visualization by Linear Projections as Information Retrieval
Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland [email protected]
Visualizing Higher-Layer Features of a Deep Network
Visualizing Higher-Layer Features of a Deep Network Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent Dept. IRO, Université de Montréal P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7,
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
A Fast Learning Algorithm for Deep Belief Nets
LETTER Communicated by Yann Le Cun A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton [email protected] Simon Osindero [email protected] Department of Computer Science, University
A fast learning algorithm for deep belief nets
A fast learning algorithm for deep belief nets Geoffrey E. Hinton and Simon Osindero Department of Computer Science University of Toronto 10 Kings College Road Toronto, Canada M5S 3G4 {hinton, osindero}@cs.toronto.edu
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition Marc Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, Yann LeCun Courant Institute of Mathematical Sciences,
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks
1 Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks Tomislav Hrkać, Karla Brkić, Zoran Kalafatić Faculty of Electrical Engineering and Computing University of
Image Classification for Dogs and Cats
Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science [email protected] Abstract
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/
Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
IBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Why Does Unsupervised Pre-training Help Deep Learning?
Journal of Machine Learning Research 11 (2010) 625-660 Submitted 8/09; Published 2/10 Why Does Unsupervised Pre-training Help Deep Learning? Dumitru Erhan Yoshua Bengio Aaron Courville Pierre-Antoine Manzagol
Learning multiple layers of representation
Review TRENDS in Cognitive Sciences Vol.11 No.10 Learning multiple layers of representation Geoffrey E. Hinton Department of Computer Science, University of Toronto, 10 King s College Road, Toronto, M5S
Generative versus discriminative training of RBMs for classification of fmri images
Generative versus discriminative training of RBMs for classification of fmri images Tanya Schmah Department of Computer Science University of Toronto Toronto, Canada [email protected] Richard S. Zemel
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data Neil D. Lawrence Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield,
Probabilistic Matrix Factorization
Probabilistic Matrix Factorization Ruslan Salakhutdinov and Andriy Mnih Department of Computer Science, University of Toronto 6 King s College Rd, M5S 3G4, Canada {rsalakhu,amnih}@cs.toronto.edu Abstract
Improving Deep Neural Network Performance by Reusing Features Trained with Transductive Transference
Improving Deep Neural Network Performance by Reusing Features Trained with Transductive Transference Chetak Kandaswamy, Luís Silva, Luís Alexandre, Jorge M. Santos, Joaquim Marques de Sá Instituto de Engenharia
Taking Inverse Graphics Seriously
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best
Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
: Explicit Invariance During Feature Extraction Salah Rifai (1) Pascal Vincent (1) Xavier Muller (1) Xavier Glorot (1) Yoshua Bengio (1) (1) Dept. IRO, Université de Montréal. Montréal (QC), H3C 3J7, Canada
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
A Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
Chapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
Handwritten Digit Recognition with a Back-Propagation Network
396 Le Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel Handwritten Digit Recognition with a Back-Propagation Network Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines
Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines Marc Aurelio Ranzato Geoffrey E. Hinton Department of Computer Science - University of Toronto 10 King s College Road,
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Designing a learning system
Lecture Designing a learning system Milos Hauskrecht [email protected] 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
Three New Graphical Models for Statistical Language Modelling
Andriy Mnih Geoffrey Hinton Department of Computer Science, University of Toronto, Canada [email protected] [email protected] Abstract The supremacy of n-gram models in statistical language modelling
Introduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
Deep learning applications and challenges in big data analytics
Najafabadi et al. Journal of Big Data (2015) 2:1 DOI 10.1186/s40537-014-0007-7 RESEARCH Open Access Deep learning applications and challenges in big data analytics Maryam M Najafabadi 1, Flavio Villanustre
Keywords: Image complexity, PSNR, Levenberg-Marquardt, Multi-layer neural network.
Global Journal of Computer Science and Technology Volume 11 Issue 3 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172
Deep Learning using Linear Support Vector Machines
Yichuan Tang [email protected] Department of Computer Science, University of Toronto. Toronto, Ontario, Canada. Abstract Recently, fully-connected and convolutional neural networks have been trained
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK
Proceedings of the IASTED International Conference Artificial Intelligence and Applications (AIA 2013) February 11-13, 2013 Innsbruck, Austria LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK Wannipa
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
Visualizing Data using t-sne
Journal of Machine Learning Research 1 (2008) 1-48 Submitted 4/00; Published 10/00 Visualizing Data using t-sne Laurens van der Maaten MICC-IKAT Maastricht University P.O. Box 616, 6200 MD Maastricht,
Feature Engineering in Machine Learning
Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Recurrent Neural Networks
Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images
For Modeling Natural Images Marc Aurelio Ranzato Alex Krizhevsky Geoffrey E. Hinton Department of Computer Science - University of Toronto Toronto, ON M5S 3G4, CANADA Abstract Deep belief nets have been
So which is the best?
Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which
Adaptive Face Recognition System from Myanmar NRC Card
Adaptive Face Recognition System from Myanmar NRC Card Ei Phyo Wai University of Computer Studies, Yangon, Myanmar Myint Myint Sein University of Computer Studies, Yangon, Myanmar ABSTRACT Biometrics is
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Make and Model Recognition of Cars
Make and Model Recognition of Cars Sparta Cheung Department of Electrical and Computer Engineering University of California, San Diego 9500 Gilman Dr., La Jolla, CA 92093 [email protected] Alice Chu Department
An Introduction to Neural Networks
An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,
Neural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
Cross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS
LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS George E. Dahl University of Toronto Department of Computer Science Toronto, ON, Canada Jack W. Stokes, Li Deng, Dong Yu
University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons
University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Colour Image Segmentation Technique for Screen Printing
60 R.U. Hewage and D.U.J. Sonnadara Department of Physics, University of Colombo, Sri Lanka ABSTRACT Screen-printing is an industry with a large number of applications ranging from printing mobile phone
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
Selecting Receptive Fields in Deep Networks
Selecting Receptive Fields in Deep Networks Adam Coates Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Andrew Y. Ng Department of Computer Science Stanford
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
