CSC321 Introduction to Neural Networks and Machine Learning. Lecture 21 Using Boltzmann machines to initialize backpropagation.



Similar documents
Supporting Online Material for

Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial)

Introduction to Machine Learning CMU-10701

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

To Recognize Shapes, First Learn to Generate Images

Taking Inverse Graphics Seriously

Optimizing content delivery through machine learning. James Schneider Anton DeFrancesco

Efficient online learning of a non-negative sparse autoencoder

A Practical Guide to Training Restricted Boltzmann Machines

Visualizing Higher-Layer Features of a Deep Network

Simplified Machine Learning for CUDA. Umar

Recurrent Neural Networks

LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK

Rectified Linear Units Improve Restricted Boltzmann Machines

Implementation of Neural Networks with Theano.

Why Does Unsupervised Pre-training Help Deep Learning?

Programming Exercise 3: Multi-class Classification and Neural Networks

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Tutorial on Deep Learning and Applications

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

A Fast Learning Algorithm for Deep Belief Nets

Efficient Learning of Sparse Representations with an Energy-Based Model

Learning multiple layers of representation

Deep learning applications and challenges in big data analytics

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Methods and Applications for Distance Based ANN Training

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

A fast learning algorithm for deep belief nets

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Deep Learning for Multivariate Financial Time Series. Gilberto Batres-Estrada

Learning to Process Natural Language in Big Data Environment

Lecture 6. Artificial Neural Networks

LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Novelty Detection in image recognition using IRF Neural Networks properties

Chapter 4: Artificial Neural Networks

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

arxiv: v2 [cs.lg] 9 Apr 2014

called Restricted Boltzmann Machines for Collaborative Filtering

Sparse deep belief net model for visual area V2

Learning Deep Architectures for AI. Contents

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Introduction to Machine Learning Using Python. Vikram Kamath

Predict Influencers in the Social Network

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Improving Deep Neural Network Performance by Reusing Features Trained with Transductive Transference

An Introduction to Neural Networks

Image and Video Understanding

Data Mining Techniques Chapter 7: Artificial Neural Networks

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Steven C.H. Hoi School of Information Systems Singapore Management University

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Various applications of restricted Boltzmann machines for bad quality training data

Data, Measurements, Features

Self Organizing Maps: Fundamentals

Lecture 8 February 4

An Introduction to Deep Learning

Data Mining Practical Machine Learning Tools and Techniques

Deep Learning for Big Data

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

STA 4273H: Statistical Machine Learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Neural Networks and Support Vector Machines

CSCI567 Machine Learning (Fall 2014)

A Hybrid Neural Network-Latent Topic Model

Do Deep Nets Really Need to be Deep?

Feed-Forward mapping networks KAIST 바이오및뇌공학과 정재승

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

A Simple Introduction to Support Vector Machines

Big Data Deep Learning: Challenges and Perspectives

Machine Learning Introduction

Deep Learning using Linear Support Vector Machines

A Deep Graphical Model for Spelling Correction

The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization

Selecting Receptive Fields in Deep Networks

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Follow links Class Use and other Permissions. For more information, send to:

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

Supervised Feature Selection & Unsupervised Dimensionality Reduction

CSC 411: Lecture 07: Multiclass Classification

Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images

Visualization by Linear Projections as Information Retrieval

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Generating Facial Expressions with Deep Belief Nets

Deep learning via Hessian-free optimization

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Transcription:

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton

Some problems with backpropagation The amount of information that each training case provides about the weights is at most the log of the number of possible output labels. So to train a big net we need lots of labeled data. In nets with many layers of weights the backpropagated derivatives either grow or shrink multiplicatively at each layer. Learning is tricky either way. Dumb gradient descent is not a good way to perform a global search for a good region of a very large, very nonlinear space. So deep nets trained by backpropagation are rare in practice.

A solution to all of these problems Use greedy unsupervised learning to find a sensible set of weights one layer at a time. Then fine-tune with backpropagation. Greedily learning one layer at a time scales well to really deep networks. Most of the information in the final weights comes from modeling the distribution of input vectors. The precious information in the labels is only used for the final fine-tuning. We do not start backpropagation until we already have sensible weights that already do well at the task. So the fine-tuning is well-behaved and quite fast.

Modelling the distribution of digit images The top two layers form a restricted Boltzmann machine whose free energy landscape should model the low dimensional manifolds of the digits. The network learns a density model for unlabeled digit images. When we generate from the model we often get things that look like real digits of all classes. More hidden layers make the generated fantasies look better (YW Teh, Simon Osindero). But do the hidden features really help with digit discrimination? Add 10 softmaxed units to the top and do backprop. 2000 units 500 units 500 units 28 x 28 pixel image

Results on permutation-invariant MNIST task Very carefully trained backprop net with 1.53% one or two hidden layers (Platt; Hinton) SVM (Decoste & Schoelkopf) 1.4% Generative model of joint density of 1.25% images and labels (with unsupervised fine-tuning) Generative model of unlabelled digits 1.2% followed by gentle backpropagtion Generative model of joint density 1.1% followed by gentle backpropagation

Learning Dynamics of Deep Nets the next 4 slides describe work by Yoshua Bengio s group Before fine-tuning After fine-tuning

Effect of Unsupervised Pre-training Erhan et. al. AISTATS 2009 7

Effect of Depth without pre-training w/o pre-training with pre-training 8

Why unsupervised pre-training makes sense stuff stuff high bandwidth low bandwidth image label image label If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity? If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.

An early use of neural nets (~1989) Use a feedforward neural net to convert a window of speech coefficients into a posterior probability distribution over short pieces of phonemes (61 phones each with 3 pieces) To train this net we need to know the correct label for each window, so we need to bootstrap from an existing speech recognition system. The trained neural net produces a posterior distribution over phone pieces at each time. We feed these distributions to a decoder which finds the most likely sequence of phonemes.

How to make the phone recognizer work much better Train lots of big layers, one at a time, without using the labels. Add 183-way softmax over labels as the final layer. Fine-tune with bckpropagation on a big GPU board for several days.

A very deep belief net for phone recognition 183 labels not pre-trained Mohamed, Dahl & Hinton (2011) 2000 binary hidden units 2000 binary hidden units 2000 binary hidden units 2000 binary hidden units 11 frames of filterbank coefficients Many of the major speech recognition groups (Google, Microsoft, IBM) are now trying this approach.

Deep Autoencoders They always looked like a really nice way to do non-linear dimensionality reduction: They provide mappings both ways The learning time is linear (or better) in the number of training cases. The final model is compact and fast. But it turned out to be very very difficult to optimize deep autoencoders using backprop. We now have a much better way to optimize them.

W The deep autoencoder 1 W2 W3 784 1000 500 250 T W1 W2 W3 784 1000 500 250 T T W 4 T W 4 30 linear units If you start with small random weights it will not learn. If you break symmetry randomly by using bigger weights, it will not find a good solution. So we train a stack of 4 RBM s and then unroll them. Then we fine-tune with gentle backprop.

A comparison of methods for compressing digit images to 30 real numbers. real data 30-D deep auto 30-D logistic PCA 30-D PCA

A very deep autoencoder for synthetic curves that only have 6 degrees of freedom squared error Data 0.0 Auto:6 1.5 PCA:6 10.3 PCA:30 3.9

An autoencoder for patches of real faces 625 2000 1000 641 30 and back out again linear logistic units linear Train on 100,000 denormalized face patches from 300 images of 30 people. Use 100 epochs of CD at each layer followed by backprop through the unfolded autoencoder. Test on face patches from 100 images of 10 new people.

Reconstructions of face patches from new people Data Auto:30 126 PCA:30 135

64 of the hidden units in the first hidden layer

How to find documents that are similar to a query document Convert each document into a bag of words. This is a vector of word counts ignoring the order. Ignore stop words (like the or over ) We could compare the word counts of the query document and millions of other documents but this is too slow. So we reduce each query vector to a much smaller vector that still contains most of the information about the content of the document. 0 0 2 2 0 2 1 1 0 0 2 fish cheese vector count school query reduce bag pulpit iraq word

How to compress the count vector 2000 reconstructed counts 500 neurons 250 neurons 10 250 neurons 500 neurons 2000 word counts output vector We train the neural network to reproduce its input vector as its output This forces it to compress as much information as possible into the 10 numbers in the central bottleneck. These 10 numbers are then a good way to compare documents. input vector

The non-linearity used for reconstructing bags of words Divide the counts in a bag of words vector by N, where N is the total number of non-stop words in the document. The resulting probability vector gives the probability of getting a particular word if we pick a non-stop word at random from the document. At the output of the autoencoder, we use a softmax. The probability vector defines the desired outputs of the softmax. When we train the first RBM in the stack we use the same trick. We treat the word counts as probabilities, but we make the visible to hidden weights N times bigger than the hidden to visible because we have N observations from the probability distribution.

Performance of the autoencoder at document retrieval Train on bags of 2000 words for 400,000 training cases of business documents. First train a stack of RBM s. Then fine-tune with backprop. Test on a separate 400,000 documents. Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes. Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons). Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document. Compare with LSA (a version of PCA).

Proportion of retrieved documents in same class as query Number of documents retrieved

First compress all documents to 2 numbers using a type of PCA Then use different colors for different document categories

First compress all documents to 2 numbers. Then use different colors for different document categories

THE END