Deep Learning Methods for Vision (draft)

Similar documents

Tutorial on Deep Learning and Applications

Visualizing Higher-Layer Features of a Deep Network

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Introduction to Machine Learning CMU-10701

Efficient online learning of a non-negative sparse autoencoder

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Supporting Online Material for

Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines

A Hybrid Neural Network-Latent Topic Model

Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Unsupervised Feature Learning and Deep Learning

Deep Belief Nets (An updated and extended version of my 2007 NIPS tutorial)

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Rectified Linear Units Improve Restricted Boltzmann Machines

Taking Inverse Graphics Seriously

Convolutional Feature Maps

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Image and Video Understanding

Manifold Learning with Variational Auto-encoder for Medical Image Analysis

Generating more realistic images using gated MRF s

Selecting Receptive Fields in Deep Networks

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Generalized Denoising Auto-Encoders as Generative Models

Novelty Detection in image recognition using IRF Neural Networks properties

A Fast Learning Algorithm for Deep Belief Nets

Factored 3-Way Restricted Boltzmann Machines For Modeling Natural Images

Sparse deep belief net model for visual area V2

Learning multiple layers of representation

On Optimization Methods for Deep Learning

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Introduction to Deep Learning Variational Inference, Mean Field Theory

A fast learning algorithm for deep belief nets

Deep learning applications and challenges in big data analytics

LEUKEMIA CLASSIFICATION USING DEEP BELIEF NETWORK

Image Classification for Dogs and Cats

Simple and efficient online algorithms for real world applications

To Recognize Shapes, First Learn to Generate Images

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

called Restricted Boltzmann Machines for Collaborative Filtering

Why Does Unsupervised Pre-training Help Deep Learning?

arxiv: v2 [cs.lg] 9 Apr 2014

A Practical Guide to Training Restricted Boltzmann Machines

Efficient Learning of Sparse Representations with an Energy-Based Model

Probabilistic Latent Semantic Analysis (plsa)

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Machine Learning and AI via Brain simulations

STA 4273H: Statistical Machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Deep Learning for Big Data

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?

Steven C.H. Hoi School of Information Systems Singapore Management University

Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Research on the Anti-perspective Correction Algorithm of QR Barcode

Local features and matching. Image classification & object localization

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Deep Learning Face Representation by Joint Identification-Verification

arxiv: v2 [cs.cv] 19 Apr 2014

Lecture 2: The SVM classifier

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Learning to Process Natural Language in Big Data Environment

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Unsupervised and Transfer Learning Challenge: a Deep Learning Approach

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Big Data Deep Learning: Challenges and Perspectives

Simplified Machine Learning for CUDA. Umar

Learning Invariant Features through Topographic Filter Maps

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

A Simple Introduction to Support Vector Machines

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Learning Deep Face Representation

Deep Learning for Multivariate Financial Time Series. Gilberto Batres-Estrada

An Introduction to Deep Learning

Evaluating probabilities under high-dimensional latent variable models

Optical Flow. Shenlong Wang CSC2541 Course Presentation Feb 2, 2016

Recurrent Neural Networks

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Learning Mid-Level Features For Recognition

Deep Learning of Invariant Features via Simulated Fixations in Video

Research Article Distributed Data Mining Based on Deep Neural Network for Wireless Sensor Network

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Università degli Studi di Bologna

Programming Exercise 3: Multi-class Classification and Neural Networks

Transcription:

1 CVPR 2012 Tutorial Deep Learning Metods for Vision (draft) Honglak Lee Computer Science and Engineering Division University of Micigan, Ann Arbor

pixel 2 Feature representations pixel 1 Input pixel 2 Input space Motorbikes Non -Motorbikes Learning algoritm pixel 1 2

pixel 2 andle Feature representations andle weel Input Input space Feature representation Motorbikes Non -Motorbikes Learning algoritm Feature space pixel 1 weel 3

How is computer perception done? State-of-te-art: and-crafting Input data Feature representation Learning algoritm Object detection Image Low-level vision features (SIFT, HOG, etc.) Object detection / classification Audio classification Audio Low-level audio features (spectrogram, MFCC, etc.) Speaker identification 4

Computer vision features SIFT Spin image HoG RIFT Textons GLOH 5

Computer vision features SIFT Spin image HoG Hand-crafted features: 1. Needs expert knowledge Textons RIFT 2. Requires time-consuming and-tuning 3. (Arguably) one of te limiting factors of computer vision systems GLOH 6

7 Learning Feature Representations Key idea: Learn statistical structure or correlation of te data from unlabeled data Te learned representations can be used as features in supervised and semi-supervised settings Known as: unsupervised feature learning, feature learning, deep learning, representation learning, etc. Topics covered in tis talk: Restricted Boltzmann Macines Deep Belief Networks Denoising Autoencoders Applications: Vision, Audio, and Multimodal learning

8 Learning Feature Hierarcy Deep Learning Deep arcitectures can be representationally efficient. 3rd layer Objects Natural progression from low level to ig level structures. Can sare te lower-level representations for multiple tasks. 2nd layer Object parts 1st layer edges Input

9 Outline Restricted Boltzmann macines Deep Belief Networks Denoising Autoencoders Applications to Vision Applications to Audio and Multimodal Data

11 Learning Feature Hierarcy [Related work: Hinton, Bengio, LeCun, Ng, and oters.] Higer layer: DBNs (Combinations of edges) First layer: RBMs (edges) Input image patc (pixels)

12 Restricted Boltzmann Macines wit binary-valued input data Representation Undirected bipartite grapical model v 0,1 D : observed (visible) binary variables 0,1 K : idden binary variables. idden (H) j i visible (V)

13 Conditional Probabilities (RBM wit binary-valued input data) Given v, all te j are conditionally independent P j = 1 v = exp i W ijv j +b j exp ( i W ij v j +b j )+1 =sigmoid( i W ij v j + b j ) =sigmoid(w T j v + b j ) 1 w 1 idden (H) 2 j w 2 3 w 3 P( v) can be used as features Given, all te v i are conditionally independent P v i = sigmoid( j W ij j + c i ) i v 1 v 2 visible (V)

14 Restricted Boltzmann Macines wit real-valued input data Representation Undirected bipartite grapical model V: observed (visible) real variables H: idden binary variables. idden (H) j i visible (V)

15 Conditional Probabilities (RBM wit real-valued input data) Given v, all te j are conditionally independent P j = 1 v = exp 1 σ exp ( 1 σ i W ij v j +b j i W ij v j +b j )+1 =sigmoid( 1 W σ i ij v j + b j ) =sigmoid( 1 w σ j T v + b j ) P( v) can be used as features 1 w 1 idden (H) i 2 j w 2 3 w 3 Given, all te v i are conditionally independent P v i = N σ j W ij j + c i, σ 2 or P v = N σw + c, σ 2 I. v 1 v 2 visible (V)

16 Inference Conditional Distribution: P(v ) or P( v) Easy to compute (see previous slides). Due to conditional independence, we can sample all idden units given all visible units in parallel (and vice versa) Joint Distribution: P(v,) Requires Gibbs Sampling (approximate; lots of iterations to converge). Initialize wit v 0 Sample 0 from P( v 0 ) Repeat until convergence (t=1, ) { Sample v t from P(v t t 1 ) Sample t from P( v t ) }

Model: Training RBMs How can we find parameters θ tat maximize P θ (v)? Data Distribution (posterior of given v) Model Distribution We need to compute P( v) and P(v,), and derivative of E wrt parameters {W,b,c} P( v): tractable P(v,): intractable Can approximate wit Gibbs sampling, but requires lots of iterations 17 17

Contrastive Divergence An approximation of te log-likeliood gradient for RBMs 1. Replace te average over all possible inputs by samples 2. Run te MCMC cain (Gibbs sampling) for only k steps starting from te observed example Initialize wit v 0 = v Sample 0 from P( v 0 ) For t = 1,,k { Sample v t from P(v t t 1 ) Sample t from P( v t ) } 18 18

A picture of te maximum likeliood learning algoritm for an RBM j j j j v i j 0 v i j i i i i t = 0 t = 1 t = 2 t = infinity Equilibrium distribution Slide Credit: Geoff Hinton 19

A quick way to learn an RBM i j v 0 1 i j v i j t = 0 t = 1 data reconstruction i j Start wit a training vector on te visible units. Update all te idden units in parallel Update te all te visible units in parallel to get a reconstruction. Update te idden units again. Slide Credit: Geoff Hinton 20

21 Update Rule: Putting togeter Training via stocastic gradient. Note, E W ij = i v j. Terefore, Can derive similar update rule for biases b and c Implemented in ~10 lines of matlab code

22 Oter ways of training RBMs Persistent CD [Tieleman, ICML 2008; Tieleman & Hinton, ICML 2009] Keep a background MCMC cain to obtain te negative pase samples. Related to Stocastic Approximation Robbins and Monro, Ann. Mat. Stats, 1957 L. Younes, Probability Teory 1989 Score Matcing [Swersky et al., ICML 2011; Hyvarinen, JMLR 2005] Use score function to eliminate Z Matc model s & empirical score function

23 Estimating Log-Likeliood of RBMs How do we measure te likeliood of te learned models? RBM: requires estimating partition function Reconstruction error provides a ceap proxy Log Z tractable analytically for < 25 binary inputs or idden Can be approximated wit Annealed Importance Sampling (AIS) Salakutdinov & Murray, ICML 2008 Open question: efficient ways to monitor progress

Variants of RBMs 25

Main idea Constrain te idden layer nodes to ave sparse average values (activation). [cf. sparse coding] Optimization Sparse RBM / DBN Tradeoff between likeliood and sparsity penalty Log-likeliood Sparsity penalty [Lee et al., NIPS 2008] we can use oter penalty functions (e.g., KLdivergence) Average activation Target sparsity 26

Modeling andwritten digits Sparse dictionary learning via sparse RBMs W 1 First layer bases ( pen-strokes ) input nodes (data) Learned sparse representations can be used as features. Training examples [Lee et al., NIPS 2008; Ranzato et al, NIPS 2007] 27

3-way factorized RBM [Ranzato et al., AISTATS 2010; Ranzato and Hinton, CVPR 2010] Models te covariance structure of images using idden variables 3-way factorized RBM / mean-covariance RBM Gaussian MRF 3-way (covariance) RBM k c F x p F x q x p C F C x q [Slide Credit: Marc Aurelio Ranzato] 28

Generating natural image patces Natural images mcrbm Ranzato and Hinton CVPR 2010 GRBM from Osindero and Hinton NIPS 2008 S-RBM + DBN from Osindero and Hinton NIPS 2008 Slide Credit: Marc Aurelio Ranzato 29

30 Outline Restricted Boltzmann macines Deep Belief Networks Denoising Autoencoders Applications to Vision Applications to Audio and Multimodal Data

32 Deep Belief Networks (DBNs) Probabilistic generative model Deep arcitecture multiple layers Unsupervised pre-learning provides a good initialization of te network maximizing te lower-bound of te log-likeliood of te data Supervised fine-tuning Generative: Up-down algoritm Discriminative: backpropagation Hinton et al., 2006

33 DBN structure 1 2 3 v Visible layer Hidden layers RBM Directed belief nets ), ( ) ( )... ( ) ( ),...,,, ( 1 1 2 l l l l l P P P P P v v 2 1 1 2 1 Hinton et al., 2006

34 DBN structure v 1 2 3 ) ( 2 1 P ) ( 1 v P ), ( 2 3 P ) ( 1 v Q ) ( 1 2 Q ) ( ) ( 2 3 2 3 P Q 1 W 2 W 3 W j i i j i j i i sigm P ) W ' ( ) (. 1 b j i i j i j i i sigm Q ) W ( ) ( 1 1 1 b ), ( ) ( )... ( ) ( ),...,,, ( 1 1 2 l l l l l P P P P P v v 2 1 1 2 1 Hinton et al., 2006 (approximate) inference Generative process

35 First step: DBN Greedy training Construct an RBM wit an input layer v and a idden layer Train te RBM Hinton et al., 2006

36 Second step: DBN Greedy training Stack anoter idden layer on top of te RBM to form a new RBM 1 1 Fix W, sample from Q( 1 v) as input. Train as RBM. 2 W 2 W Hinton et al., 2006 Q( 1 v) 1 W

37 Tird step: DBN Greedy training Continue to stack layers on top of te network, train it as previous step, wit sample sampled 2 1 from Q( ) And so on Q( 2 1 ) 3 W 2 W 3 Hinton et al., 2006 Q( 1 v) 1 W

38 Wy greedy training works? RBM specifies P(v,) from P(v ) and P( v) Implicitly defines P(v) and P() Key idea of stacking Keep P(v ) from 1st RBM Replace P() by te distribution generated by 2nd level RBM Hinton et al., 2006

39 Wy greedy training works? Greey Training: Variational lower-bound justifies greedy layerwise training of RBMs Hinton et al., 2006 Q( v) 1 W Trained by te second layer RBM

40 Wy greedy training works? Greey Training: Variational lower-bound justifies greedy layerwise training of RBMs Note: RBM and 2-layer DBN are equivalent wen W 2 = W 1 T. Terefore, te lower bound is tigt and te log-likeliood improves by greedy training. Q( 1 v) 2 W 1 W Hinton et al., 2006 Trained by te second layer RBM

41 DBN and supervised fine-tuning Discriminative fine-tuning Initializing wit neural nets + backpropagation Maximizes log P( Y X ) (X: data Y: label) Generative fine-tuning Up-down algoritm Maximizes (joint likeliood of data and labels) log P( Y, X )

A model for digit recognition Te top two layers form an associative memory wose energy landscape models te low dimensional manifolds of te digits. Te energy valleys ave names 2000 top-level neurons 10 label neurons 500 neurons Te model learns to generate combinations of labels and images. To perform recognition we start wit a neutral state of te label units and do an up-pass from te image followed by a few iterations of te top-level associative memory. ttp://www.cs.toronto.edu/~inton/nipstutorial/nipstut3.ppt 500 neurons 28 x 28 pixel image Slide Credit: Geoff Hinton 43

Fine-tuning wit a contrastive version of te wake-sleep algoritm After learning many layers of features, we can fine-tune te features to improve generation. 1. Do a stocastic bottom-up pass Adjust te top-down weigts to be good at reconstructing te feature activities in te layer below. 2. Do a few iterations of sampling in te top level RBM -- Adjust te weigts in te top-level RBM. 3. Do a stocastic top-down pass Adjust te bottom-up weigts to be good at reconstructing te feature activities in te layer above. ttp://www.cs.toronto.edu/~inton/nipstutorial/nipstut3.ppt Slide Credit: Geoff Hinton 44

45 Generating sample from a DBN Want to sample from Sample using Gibbs sampling in te RBM Sample te lower layer from l 1 ) ( 1 i i P i 1 ), ( ) ( )... ( ) ( ),...,,, ( 1 1 2 l l l l l P P P P P v v 2 1 1 2 1

Generating samples from DBN Hinton et al, A Fast Learning Algoritm for Deep Belief Nets, 2006 46

Result for supervised fine-tuning on MNIST Very carefully trained backprop net wit 1.6% one or two idden layers (Platt; Hinton) SVM (Decoste & Scoelkopf, 2002) 1.4% Generative model of joint density of 1.25% images and labels (+ generative fine-tuning) Generative model of unlabelled digits 1.15% followed by gentle backpropagation (Hinton & Salakutdinov, Science 2006) ttp://www.cs.toronto.edu/~inton/nipstutorial/nipstut3.ppt Slide Credit: Geoff Hinton 47

48 More details on up-down algoritm: Hinton, G. E., Osindero, S. and Te, Y. (2006) A fast learning algoritm for deep belief nets, Neural Computation, 18, pp 1527-1554. ttp://www.cs.toronto.edu/~inton/absps/ncfast.pdf Handwritten digit demo: ttp://www.cs.toronto.edu/~inton/digits.tml

49 Outline Restricted Boltzmann macines Deep Belief Networks Denoising Autoencoders Applications to Vision Applications to Audio and Multimodal Data

Denoising Autoencoder Denoising Autoencoder Perturbs te input x to a corrupted version: E.g., randomly sets some of te coordinates of input to zeros. Recover x from encoded of perturbed data. Minimize loss between x and x [Vincent et al., ICML 2008] 51

Denoising Autoencoder Learns a vector field towards iger probability regions Minimizes variational lower bound on a generative model Corresponds to regularized score matcing on an RBM [Vincent et al., ICML 2008] Corrupted input Corrupted input Slide Credit: Yosua Bengio 52

Stacked Denoising Autoencoders Greedy Layer wise learning Start wit te lowest level and stack upwards Train eac layer of autoencoder on te intermediate code (features) from te layer below Top layer can ave a different output (e.g., softmax nonlinearity) to provide an output for classification Slide Credit: Yosua Bengio 53

Stacked Denoising Autoencoders No partition function, can measure training criterion Encoder & decoder: any parametrization Performs as well or better tan stacking RBMs for usupervised pre-training Infinite MNIST Slide Credit: Yosua Bengio 54

Denoising Autoencoders: Bencmarks Larocelle et al., 2009 Slide Credit: Yosua Bengio 55

Denoising Autoencoders: Results Test errors on te bencmarks Larocelle et al., 2009 Slide Credit: Yosua Bengio 56

Wy Greedy Layer Wise Training Works (Bengio 2009, Eran et al. 2009) Regularization Hypotesis Pre-training is constraining parameters in a region relevant to unsupervised dataset Better generalization (Representations tat better describe unlabeled data are more discriminative for labeled data) Optimization Hypotesis Unsupervised training initializes lower level parameters near localities of better minima tan random initialization can Slide Credit: Yosua Bengio 58

65 Outline Restricted Boltzmann macines Deep Belief Networks Denoising Autoencoders Applications to Vision Applications to Audio and Multimodal Data

67 Convolutional Neural Networks (LeCun et al., 1989) Local Receptive Fields Weigt saring Pooling

Deep Convolutional Arcitectures State-of-te-art on MNIST digits, Caltec-101 objects, etc. Slide Credit: Yann LeCun 70

71 Learning object representations Learning objects and parts in images Large image patces contain interesting igerlevel structures. E.g., object parts and full objects

72 Illustration: Learning an eye detector Srink (max over 2x2) Eye detector Advantage of srinking 1. Filter size is kept smal 2. Invariance filter1 filter2 filter3 filter4 Filtering output Example image

Convolutional RBM (CRBM) [Lee et al, ICML 2009] [Related work: Norouzi et al., CVPR 2009; Desjardins and Bengio, 2008] For filter k, max-pooling node (binary) Max-pooling layer P Detection layer H Constraint: At most one idden node is 1 (active). Hidden nodes (binary) W k Filter weigts (sared) V Input (visible data V layer) Key Properties Convolutional structure Probabilistic max-pooling ( mutual exclusion ) 73

Convolutional deep belief networks illustration Layer 3 W 3 Layer 3 activation (coefficients) Layer 2 W 2 Layer 2 activation (coefficients) Layer 1 W 1 Filter visualization Layer 1 activation (coefficients) Input image Example image 74

75 Unsupervised learning from natural images Second layer bases contours, corners, arcs, surface boundaries First layer bases localized, oriented edges

76 Unsupervised learning of object-parts Faces Cars Elepants Cairs

Image classification wit Spatial Pyramids [Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009] Descriptor Layer: detect and locate features, extract corresponding descriptors (e.g. SIFT) Code Layer: code te descriptors Vector Quantization (VQ): eac code as only one non-zero element Soft-VQ: small group of elements can be non-zero SPM layer: pool codes across subregions and average/normalize into a istogram Slide Credit: Kai Yu 78

Improving te coding step Classifiers using tese features need nonlinear kernels Lazebnik et al., CVPR 2005; Grauman and Darrell, JMLR 2007 Hig computational complexity Idea: modify te coding step to produce feature representations tat linear classifiers can use effectively Sparse coding [Olsausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010] Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] RBMs [Son, Jung, Lee, Hero III, ICCV 2011] Oter feature learning algoritms Slide Credit: Kai Yu 79

80 Object Recognition Results Classification accuracy on Caltec 101/256 < Caltec 101 > # of training images 15 30 Zang et al., CVPR 2005 59.1 66.2 Griffin et al., 2008 59.0 67.6 ScSPM [Yang et al., CVPR 2009] 67.0 73.2 LLC [Wang et al., CVPR 2010] 65.4 73.4 Macrofeatures [Boureau et al., CVPR 2010] - 75.7 Boureau et al., ICCV 2011-77.1 Sparse RBM [Son et al., ICCV 2011] 68.6 74.9 Sparse CRBM [Son et al., ICCV 2011] 71.3 77.8 Competitive performance to oter state-of-te-art metods using a single type of features

81 Object Recognition Results Classification accuracy on Caltec 101/256 < Caltec 256 > # of training images 30 60 Griffin et al. [2] 34.10 - vangemert et al., PAMI 2010 27.17 - ScSPM [Yang et al., CVPR 2009] 34.02 40.14 LLC [Wang et al., CVPR 2010] 41.19 47.68 Sparse CRBM [Son et al., ICCV 2011] 42.05 47.94 Competitive performance to oter state-of-te-art metods using a single type of features

Local Convolutional RBM Modeling convolutional structures in local regions jointly More statistically effective in learning for nonstationary (rougly aligned) images [Huang, Lee, Learned-Miller, CVPR 2012] 88

89 Face Verification [Huang, Lee, Learned-Miller, CVPR 2012] Feature Learning Convolutional Deep Belief Network 1 to 2 layers of CRBM/local CRBM Face Verification LFW pairs Representation witened pixel intensity/lbp Representation top-most pooling layer Faces Source Verification Algoritm Metric Learning + SVM Kyoto (self-taugt learning) matced Prediction mismatced

90 Face Verification [Huang, Lee, Learned-Miller, CVPR 2012] Metod Accuracy ± SE V1-like wit MKL (Pinto et al., CVPR 2009) 0.7935 ± 0.0055 Linear rectified units (Nair and Hinton, ICML 2010) 0.8073 ± 0.0134 CSML (Nguyen & Bai, ACCV 2010) 0.8418 ± 0.0048 Learning-based descriptor (Cao et al., CVPR 2010) 0.8445 ± 0.0046 OSS, TSS, full (Wolf et al., ACCV 2009) 0.8683 ± 0.0034 OSS only (Wolf et al., ACCV 2009) 0.8207 ± 0.0041 Combined (LBP + deep learning features) 0.8777 ± 0.0062

Tiled Convolutional models IDEA: ave one subset of filters applied to tese locations, 91 91

Tiled Convolutional models IDEA: ave one subset of filters applied to tese locations, anoter subset to tese locations 92 92

Tiled Convolutional models IDEA: ave one subset of filters applied to tese locations, anoter subset to tese locations, etc. Train jointly all parameters. Gregor LeCun arxiv 2010 Ranzato, Mni, Hinton NIPS 2010 No block artifacts Reduced redundancy 93 93

Tiled Convolutional models Treat tese units as data to train a similar model on te top SECOND STAGE Field of binary RBM's. Eac idden unit as a receptive field of 30x30 pixels in input space. 94 94

Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions neutral sadness surprise 95 Ranzato, et al. CVPR 2011 95

Facial Expression Recognition Drawing samples from te model (5 t layer wit 128 iddens) 5 4 RBM... RBM RBM RBM 5 t layer RBM... gmrf 1 st layer 96 96

Facial Expression Recognition Drawing samples from te model (5 t layer wit 128 iddens) 97 Ranzato, et al. CVPR 2011 97

Facial Expression Recognition - 7 syntetic occlusions - use generative model to fill-in (conditional on te known pixels) RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM... gmrf gmrf gmrf gmrf gmrf 98 Ranzato, et al. CVPR 2011 98

Facial Expression Recognition originals Type 1 occlusion: eyes Restored images 99 Ranzato, et al. CVPR 2011 99

Facial Expression Recognition originals Type 2 occlusion: mout Restored images 100 Ranzato, et al. CVPR 2011 100

Facial Expression Recognition originals Type 3 occlusion: rigt alf Restored images 101 Ranzato, et al. CVPR 2011 101

Facial Expression Recognition originals Type 5 occlusion: top alf Restored images 103 Ranzato, et al. CVPR 2011 103

Facial Expression Recognition originals Type 7 occlusion: 70% of pixels at random Restored images 105 Ranzato, et al. CVPR 2011 105

Facial Expression Recognition occluded images for bot training and test Dailey, et al. J. Cog. Neuros. 2003 Wrigt, et al. PAMI 2008 106 Ranzato, et al. CVPR 2011 106

107 Outline Restricted Boltzmann macines Deep Belief Networks Denoising Autoencoders Applications to Vision Applications to Audio and Multimodal Data

109 Motivation: Multi-modal learning Single learning algoritms tat combine multiple input domains Images Audio & speec Video Text Robotic sensors Time-series data Oters image audio text

110 Motivation: Multi-modal learning Benefits: more robust performance in Multimedia processing Biomedical data mining fmri PET scan X-ray EEG Ultra sound Robot perception Visible ligt image Audio Termal Infrared Camera array 3d range scans

Sparse dictionary learning on audio Spectrogram ~ 0.9 * + 0.7 * + 0.2 * x 36 42 [Lee, Largman, Pam, 63 Ng, NIPS 2009] 111

frequency Convolutional DBN for audio Max pooling node Detection nodes Spectrogram time [Lee, Largman, Pam, Ng, NIPS 2009] 113

frequency Convolutional DBN for audio Spectrogram time [Lee, Largman, Pam, Ng, NIPS 2009] 114

Convolutional DBN for audio [Lee, Largman, Pam, Ng, NIPS 2009] 115

Convolutional DBN for audio Max pooling Detection nodes Second CDBN layer Max pooling Detection nodes One CDBN layer [Lee, Largman, Pam, Ng, NIPS 2009] 116

CDBNs for speec Trained on unlabeled TIMIT corpus Learned first-layer bases [Lee, Largman, Pam, Ng, NIPS 2009] 117

Poneme First layer bases 118 Comparison of bases to ponemes oy el s

Experimental Results Speaker identification TIMIT Speaker identification Accuracy Prior art (Reynolds, 1995) 99.7% Convolutional DBN 100.0% Pone classification TIMIT Pone classification Accuracy Clarkson et al. (1999) 77.6% Petrov et al. (2007) 78.6% Sa & Saul (2006) 78.9% Yu et al. (2009) 79.2% Convolutional DBN 80.3% Transformation-invariant RBM (Son et al., ICML 2012) 81.5% [Lee, Pam, Largman, Ng, NIPS 2009] 120

Pone recognition using mcrbm Mean-covariance RBM + DBN Mean-covariance RBM [Dal, Ranzato, Moamed, Hinton, NIPS 2009] 122

Speec Recognition on TIMIT Metod PER Stocastic Segmental Models 36.0% Conditional Random Field 34.8% Large-Margin GMM 33.0% CD-HMM 27.3% Augmented conditional Random Fields 26.6% Recurrent Neural Nets 26.1% Bayesian Tripone HMM 25.6% Monopone HTMs 24.8% Heterogeneous Classifiers 24.4% Deep Belief Networks(DBNs) 23.0% Tripone HMMs discriminatively trained w/ BMMI 22.7% Deep Belief Networks wit mcrbm feature extraction 20.5% (Dal et al., NIPS 2010) 123

Multimodal Feature Learning Lip reading via multimodal feature learning (audio / visual data) Slide credit: Jiquan Ngiam 126

Multimodal Feature Learning Lip reading via multimodal feature learning (audio / visual data) Q. Is concatenating te best option? Slide credit: Jiquan Ngiam 127

Multimodal Feature Learning Concatenating and learning features (via a single layer)doesn t work Mostly unimodal features are learned 128

Multimodal Feature Learning Bimodal autoencoder Idea: predict unseen modality from observed modality Ngiam et al., ICML 2011 129

Multimodal Feature Learning Visualization of learned filters Audio(spectrogram) and Video features learned over 100ms windows Results: AVLetters Lip reading dataset Metod Accuracy Prior art (Zao et al., 2009) 58.9% Multimodal deep autoencoder (Ngiam et al., 2011) 65.8% 130

131 Summary Learning Feature Representations Restricted Boltzmann Macines Deep Belief Networks Stacked Denoising Autoencoders Deep learning algoritms and unsupervised feature learning algoritms sow promising results in many applications vision, audio, multimodal data, and oters.

Tank you! 132