Deep Learning Methods for Vision (draft)

Transcription

1 1 CVPR 2012 Tutorial Deep Learning Metods for Vision (draft) Honglak Lee Computer Science and Engineering Division University of Micigan, Ann Arbor

2 pixel 2 Feature representations pixel 1 Input pixel 2 Input space Motorbikes Non -Motorbikes Learning algoritm pixel 1 2

3 pixel 2 andle Feature representations andle weel Input Input space Feature representation Motorbikes Non -Motorbikes Learning algoritm Feature space pixel 1 weel 3

4 How is computer perception done? State-of-te-art: and-crafting Input data Feature representation Learning algoritm Object detection Image Low-level vision features (SIFT, HOG, etc.) Object detection / classification Audio classification Audio Low-level audio features (spectrogram, MFCC, etc.) Speaker identification 4

5 Computer vision features SIFT Spin image HoG RIFT Textons GLOH 5

6 Computer vision features SIFT Spin image HoG Hand-crafted features: 1. Needs expert knowledge Textons RIFT 2. Requires time-consuming and-tuning 3. (Arguably) one of te limiting factors of computer vision systems GLOH 6

7 7 Learning Feature Representations Key idea: Learn statistical structure or correlation of te data from unlabeled data Te learned representations can be used as features in supervised and semi-supervised settings Known as: unsupervised feature learning, feature learning, deep learning, representation learning, etc. Topics covered in tis talk: Restricted Boltzmann Macines Deep Belief Networks Denoising Autoencoders Applications: Vision, Audio, and Multimodal learning

8 8 Learning Feature Hierarcy Deep Learning Deep arcitectures can be representationally efficient. 3rd layer Objects Natural progression from low level to ig level structures. Can sare te lower-level representations for multiple tasks. 2nd layer Object parts 1st layer edges Input

9 9 Outline Restricted Boltzmann macines Deep Belief Networks Denoising Autoencoders Applications to Vision Applications to Audio and Multimodal Data

10 11 Learning Feature Hierarcy [Related work: Hinton, Bengio, LeCun, Ng, and oters.] Higer layer: DBNs (Combinations of edges) First layer: RBMs (edges) Input image patc (pixels)

11 12 Restricted Boltzmann Macines wit binary-valued input data Representation Undirected bipartite grapical model v 0,1 D : observed (visible) binary variables 0,1 K : idden binary variables. idden (H) j i visible (V)

12 13 Conditional Probabilities (RBM wit binary-valued input data) Given v, all te j are conditionally independent P j = 1 v = exp i W ijv j +b j exp ( i W ij v j +b j )+1 =sigmoid( i W ij v j + b j ) =sigmoid(w T j v + b j ) 1 w 1 idden (H) 2 j w 2 3 w 3 P( v) can be used as features Given, all te v i are conditionally independent P v i = sigmoid( j W ij j + c i ) i v 1 v 2 visible (V)

13 14 Restricted Boltzmann Macines wit real-valued input data Representation Undirected bipartite grapical model V: observed (visible) real variables H: idden binary variables. idden (H) j i visible (V)

14 15 Conditional Probabilities (RBM wit real-valued input data) Given v, all te j are conditionally independent P j = 1 v = exp 1 σ exp ( 1 σ i W ij v j +b j i W ij v j +b j )+1 =sigmoid( 1 W σ i ij v j + b j ) =sigmoid( 1 w σ j T v + b j ) P( v) can be used as features 1 w 1 idden (H) i 2 j w 2 3 w 3 Given, all te v i are conditionally independent P v i = N σ j W ij j + c i, σ 2 or P v = N σw + c, σ 2 I. v 1 v 2 visible (V)

15 16 Inference Conditional Distribution: P(v ) or P( v) Easy to compute (see previous slides). Due to conditional independence, we can sample all idden units given all visible units in parallel (and vice versa) Joint Distribution: P(v,) Requires Gibbs Sampling (approximate; lots of iterations to converge). Initialize wit v 0 Sample 0 from P( v 0 ) Repeat until convergence (t=1, ) { Sample v t from P(v t t 1 ) Sample t from P( v t ) }

16 Model: Training RBMs How can we find parameters θ tat maximize P θ (v)? Data Distribution (posterior of given v) Model Distribution We need to compute P( v) and P(v,), and derivative of E wrt parameters {W,b,c} P( v): tractable P(v,): intractable Can approximate wit Gibbs sampling, but requires lots of iterations 17 17

17 Contrastive Divergence An approximation of te log-likeliood gradient for RBMs 1. Replace te average over all possible inputs by samples 2. Run te MCMC cain (Gibbs sampling) for only k steps starting from te observed example Initialize wit v 0 = v Sample 0 from P( v 0 ) For t = 1,,k { Sample v t from P(v t t 1 ) Sample t from P( v t ) } 18 18

18 A picture of te maximum likeliood learning algoritm for an RBM j j j j v i j 0 v i j i i i i t = 0 t = 1 t = 2 t = infinity Equilibrium distribution Slide Credit: Geoff Hinton 19

19 A quick way to learn an RBM i j v 0 1 i j v i j t = 0 t = 1 data reconstruction i j Start wit a training vector on te visible units. Update all te idden units in parallel Update te all te visible units in parallel to get a reconstruction. Update te idden units again. Slide Credit: Geoff Hinton 20

20 21 Update Rule: Putting togeter Training via stocastic gradient. Note, E W ij = i v j. Terefore, Can derive similar update rule for biases b and c Implemented in ~10 lines of matlab code

21 22 Oter ways of training RBMs Persistent CD [Tieleman, ICML 2008; Tieleman & Hinton, ICML 2009] Keep a background MCMC cain to obtain te negative pase samples. Related to Stocastic Approximation Robbins and Monro, Ann. Mat. Stats, 1957 L. Younes, Probability Teory 1989 Score Matcing [Swersky et al., ICML 2011; Hyvarinen, JMLR 2005] Use score function to eliminate Z Matc model s & empirical score function

22 23 Estimating Log-Likeliood of RBMs How do we measure te likeliood of te learned models? RBM: requires estimating partition function Reconstruction error provides a ceap proxy Log Z tractable analytically for < 25 binary inputs or idden Can be approximated wit Annealed Importance Sampling (AIS) Salakutdinov & Murray, ICML 2008 Open question: efficient ways to monitor progress

23 Variants of RBMs 25

24 Main idea Constrain te idden layer nodes to ave sparse average values (activation). [cf. sparse coding] Optimization Sparse RBM / DBN Tradeoff between likeliood and sparsity penalty Log-likeliood Sparsity penalty [Lee et al., NIPS 2008] we can use oter penalty functions (e.g., KLdivergence) Average activation Target sparsity 26

25 Modeling andwritten digits Sparse dictionary learning via sparse RBMs W 1 First layer bases ( pen-strokes ) input nodes (data) Learned sparse representations can be used as features. Training examples [Lee et al., NIPS 2008; Ranzato et al, NIPS 2007] 27

26 3-way factorized RBM [Ranzato et al., AISTATS 2010; Ranzato and Hinton, CVPR 2010] Models te covariance structure of images using idden variables 3-way factorized RBM / mean-covariance RBM Gaussian MRF 3-way (covariance) RBM k c F x p F x q x p C F C x q [Slide Credit: Marc Aurelio Ranzato] 28

27 Generating natural image patces Natural images mcrbm Ranzato and Hinton CVPR 2010 GRBM from Osindero and Hinton NIPS 2008 S-RBM + DBN from Osindero and Hinton NIPS 2008 Slide Credit: Marc Aurelio Ranzato 29

29 32 Deep Belief Networks (DBNs) Probabilistic generative model Deep arcitecture multiple layers Unsupervised pre-learning provides a good initialization of te network maximizing te lower-bound of te log-likeliood of te data Supervised fine-tuning Generative: Up-down algoritm Discriminative: backpropagation Hinton et al., 2006

30 33 DBN structure v Visible layer Hidden layers RBM Directed belief nets ), ( ) ( )... ( ) ( ),...,,, ( l l l l l P P P P P v v Hinton et al., 2006

31 34 DBN structure v ) ( 2 1 P ) ( 1 v P ), ( 2 3 P ) ( 1 v Q ) ( 1 2 Q ) ( ) ( P Q 1 W 2 W 3 W j i i j i j i i sigm P ) W ' ( ) (. 1 b j i i j i j i i sigm Q ) W ( ) ( b ), ( ) ( )... ( ) ( ),...,,, ( l l l l l P P P P P v v Hinton et al., 2006 (approximate) inference Generative process

32 35 First step: DBN Greedy training Construct an RBM wit an input layer v and a idden layer Train te RBM Hinton et al., 2006

33 36 Second step: DBN Greedy training Stack anoter idden layer on top of te RBM to form a new RBM 1 1 Fix W, sample from Q( 1 v) as input. Train as RBM. 2 W 2 W Hinton et al., 2006 Q( 1 v) 1 W

34 37 Tird step: DBN Greedy training Continue to stack layers on top of te network, train it as previous step, wit sample sampled 2 1 from Q( ) And so on Q( 2 1 ) 3 W 2 W 3 Hinton et al., 2006 Q( 1 v) 1 W

35 38 Wy greedy training works? RBM specifies P(v,) from P(v ) and P( v) Implicitly defines P(v) and P() Key idea of stacking Keep P(v ) from 1st RBM Replace P() by te distribution generated by 2nd level RBM Hinton et al., 2006

36 39 Wy greedy training works? Greey Training: Variational lower-bound justifies greedy layerwise training of RBMs Hinton et al., 2006 Q( v) 1 W Trained by te second layer RBM

37 40 Wy greedy training works? Greey Training: Variational lower-bound justifies greedy layerwise training of RBMs Note: RBM and 2-layer DBN are equivalent wen W 2 = W 1 T. Terefore, te lower bound is tigt and te log-likeliood improves by greedy training. Q( 1 v) 2 W 1 W Hinton et al., 2006 Trained by te second layer RBM

38 41 DBN and supervised fine-tuning Discriminative fine-tuning Initializing wit neural nets + backpropagation Maximizes log P( Y X ) (X: data Y: label) Generative fine-tuning Up-down algoritm Maximizes (joint likeliood of data and labels) log P( Y, X )

39 A model for digit recognition Te top two layers form an associative memory wose energy landscape models te low dimensional manifolds of te digits. Te energy valleys ave names 2000 top-level neurons 10 label neurons 500 neurons Te model learns to generate combinations of labels and images. To perform recognition we start wit a neutral state of te label units and do an up-pass from te image followed by a few iterations of te top-level associative memory. ttp:// 500 neurons 28 x 28 pixel image Slide Credit: Geoff Hinton 43

40 Fine-tuning wit a contrastive version of te wake-sleep algoritm After learning many layers of features, we can fine-tune te features to improve generation. 1. Do a stocastic bottom-up pass Adjust te top-down weigts to be good at reconstructing te feature activities in te layer below. 2. Do a few iterations of sampling in te top level RBM -- Adjust te weigts in te top-level RBM. 3. Do a stocastic top-down pass Adjust te bottom-up weigts to be good at reconstructing te feature activities in te layer above. ttp:// Slide Credit: Geoff Hinton 44

41 45 Generating sample from a DBN Want to sample from Sample using Gibbs sampling in te RBM Sample te lower layer from l 1 ) ( 1 i i P i 1 ), ( ) ( )... ( ) ( ),...,,, ( l l l l l P P P P P v v

42 Generating samples from DBN Hinton et al, A Fast Learning Algoritm for Deep Belief Nets,

43 Result for supervised fine-tuning on MNIST Very carefully trained backprop net wit 1.6% one or two idden layers (Platt; Hinton) SVM (Decoste & Scoelkopf, 2002) 1.4% Generative model of joint density of 1.25% images and labels (+ generative fine-tuning) Generative model of unlabelled digits 1.15% followed by gentle backpropagation (Hinton & Salakutdinov, Science 2006) ttp:// Slide Credit: Geoff Hinton 47

44 48 More details on up-down algoritm: Hinton, G. E., Osindero, S. and Te, Y. (2006) A fast learning algoritm for deep belief nets, Neural Computation, 18, pp ttp:// Handwritten digit demo: ttp://

46 Denoising Autoencoder Denoising Autoencoder Perturbs te input x to a corrupted version: E.g., randomly sets some of te coordinates of input to zeros. Recover x from encoded of perturbed data. Minimize loss between x and x [Vincent et al., ICML 2008] 51

47 Denoising Autoencoder Learns a vector field towards iger probability regions Minimizes variational lower bound on a generative model Corresponds to regularized score matcing on an RBM [Vincent et al., ICML 2008] Corrupted input Corrupted input Slide Credit: Yosua Bengio 52

48 Stacked Denoising Autoencoders Greedy Layer wise learning Start wit te lowest level and stack upwards Train eac layer of autoencoder on te intermediate code (features) from te layer below Top layer can ave a different output (e.g., softmax nonlinearity) to provide an output for classification Slide Credit: Yosua Bengio 53

49 Stacked Denoising Autoencoders No partition function, can measure training criterion Encoder & decoder: any parametrization Performs as well or better tan stacking RBMs for usupervised pre-training Infinite MNIST Slide Credit: Yosua Bengio 54

50 Denoising Autoencoders: Bencmarks Larocelle et al., 2009 Slide Credit: Yosua Bengio 55

51 Denoising Autoencoders: Results Test errors on te bencmarks Larocelle et al., 2009 Slide Credit: Yosua Bengio 56

52 Wy Greedy Layer Wise Training Works (Bengio 2009, Eran et al. 2009) Regularization Hypotesis Pre-training is constraining parameters in a region relevant to unsupervised dataset Better generalization (Representations tat better describe unlabeled data are more discriminative for labeled data) Optimization Hypotesis Unsupervised training initializes lower level parameters near localities of better minima tan random initialization can Slide Credit: Yosua Bengio 58

54 67 Convolutional Neural Networks (LeCun et al., 1989) Local Receptive Fields Weigt saring Pooling

55 Deep Convolutional Arcitectures State-of-te-art on MNIST digits, Caltec-101 objects, etc. Slide Credit: Yann LeCun 70

56 71 Learning object representations Learning objects and parts in images Large image patces contain interesting igerlevel structures. E.g., object parts and full objects

57 72 Illustration: Learning an eye detector Srink (max over 2x2) Eye detector Advantage of srinking 1. Filter size is kept smal 2. Invariance filter1 filter2 filter3 filter4 Filtering output Example image

58 Convolutional RBM (CRBM) [Lee et al, ICML 2009] [Related work: Norouzi et al., CVPR 2009; Desjardins and Bengio, 2008] For filter k, max-pooling node (binary) Max-pooling layer P Detection layer H Constraint: At most one idden node is 1 (active). Hidden nodes (binary) W k Filter weigts (sared) V Input (visible data V layer) Key Properties Convolutional structure Probabilistic max-pooling ( mutual exclusion ) 73

59 Convolutional deep belief networks illustration Layer 3 W 3 Layer 3 activation (coefficients) Layer 2 W 2 Layer 2 activation (coefficients) Layer 1 W 1 Filter visualization Layer 1 activation (coefficients) Input image Example image 74

60 75 Unsupervised learning from natural images Second layer bases contours, corners, arcs, surface boundaries First layer bases localized, oriented edges

61 76 Unsupervised learning of object-parts Faces Cars Elepants Cairs

62 Image classification wit Spatial Pyramids [Lazebnik et al., CVPR 2005; Yang et al., CVPR 2009] Descriptor Layer: detect and locate features, extract corresponding descriptors (e.g. SIFT) Code Layer: code te descriptors Vector Quantization (VQ): eac code as only one non-zero element Soft-VQ: small group of elements can be non-zero SPM layer: pool codes across subregions and average/normalize into a istogram Slide Credit: Kai Yu 78

63 Improving te coding step Classifiers using tese features need nonlinear kernels Lazebnik et al., CVPR 2005; Grauman and Darrell, JMLR 2007 Hig computational complexity Idea: modify te coding step to produce feature representations tat linear classifiers can use effectively Sparse coding [Olsausen & Field, Nature 1996; Lee et al., NIPS 2007; Yang et al., CVPR 2009; Boureau et al., CVPR 2010] Local Coordinate coding [Yu et al., NIPS 2009; Wang et al., CVPR 2010] RBMs [Son, Jung, Lee, Hero III, ICCV 2011] Oter feature learning algoritms Slide Credit: Kai Yu 79

64 80 Object Recognition Results Classification accuracy on Caltec 101/256 < Caltec 101 > # of training images Zang et al., CVPR Griffin et al., ScSPM [Yang et al., CVPR 2009] LLC [Wang et al., CVPR 2010] Macrofeatures [Boureau et al., CVPR 2010] Boureau et al., ICCV Sparse RBM [Son et al., ICCV 2011] Sparse CRBM [Son et al., ICCV 2011] Competitive performance to oter state-of-te-art metods using a single type of features

65 81 Object Recognition Results Classification accuracy on Caltec 101/256 < Caltec 256 > # of training images Griffin et al. [2] vangemert et al., PAMI ScSPM [Yang et al., CVPR 2009] LLC [Wang et al., CVPR 2010] Sparse CRBM [Son et al., ICCV 2011] Competitive performance to oter state-of-te-art metods using a single type of features

66 Local Convolutional RBM Modeling convolutional structures in local regions jointly More statistically effective in learning for nonstationary (rougly aligned) images [Huang, Lee, Learned-Miller, CVPR 2012] 88

67 89 Face Verification [Huang, Lee, Learned-Miller, CVPR 2012] Feature Learning Convolutional Deep Belief Network 1 to 2 layers of CRBM/local CRBM Face Verification LFW pairs Representation witened pixel intensity/lbp Representation top-most pooling layer Faces Source Verification Algoritm Metric Learning + SVM Kyoto (self-taugt learning) matced Prediction mismatced

68 90 Face Verification [Huang, Lee, Learned-Miller, CVPR 2012] Metod Accuracy ± SE V1-like wit MKL (Pinto et al., CVPR 2009) ± Linear rectified units (Nair and Hinton, ICML 2010) ± CSML (Nguyen & Bai, ACCV 2010) ± Learning-based descriptor (Cao et al., CVPR 2010) ± OSS, TSS, full (Wolf et al., ACCV 2009) ± OSS only (Wolf et al., ACCV 2009) ± Combined (LBP + deep learning features) ±

69 Tiled Convolutional models IDEA: ave one subset of filters applied to tese locations, 91 91

70 Tiled Convolutional models IDEA: ave one subset of filters applied to tese locations, anoter subset to tese locations 92 92

71 Tiled Convolutional models IDEA: ave one subset of filters applied to tese locations, anoter subset to tese locations, etc. Train jointly all parameters. Gregor LeCun arxiv 2010 Ranzato, Mni, Hinton NIPS 2010 No block artifacts Reduced redundancy 93 93

72 Tiled Convolutional models Treat tese units as data to train a similar model on te top SECOND STAGE Field of binary RBM's. Eac idden unit as a receptive field of 30x30 pixels in input space

73 Facial Expression Recognition Toronto Face Dataset (J. Susskind et al. 2010) ~ 100K unlabeled faces from different sources ~ 4K labeled images Resolution: 48x48 pixels 7 facial expressions neutral sadness surprise 95 Ranzato, et al. CVPR

74 Facial Expression Recognition Drawing samples from te model (5 t layer wit 128 iddens) 5 4 RBM... RBM RBM RBM 5 t layer RBM... gmrf 1 st layer 96 96

75 Facial Expression Recognition Drawing samples from te model (5 t layer wit 128 iddens) 97 Ranzato, et al. CVPR

76 Facial Expression Recognition - 7 syntetic occlusions - use generative model to fill-in (conditional on te known pixels) RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM RBM... gmrf gmrf gmrf gmrf gmrf 98 Ranzato, et al. CVPR

77 Facial Expression Recognition originals Type 1 occlusion: eyes Restored images 99 Ranzato, et al. CVPR

78 Facial Expression Recognition originals Type 2 occlusion: mout Restored images 100 Ranzato, et al. CVPR

79 Facial Expression Recognition originals Type 3 occlusion: rigt alf Restored images 101 Ranzato, et al. CVPR

80 Facial Expression Recognition originals Type 5 occlusion: top alf Restored images 103 Ranzato, et al. CVPR

81 Facial Expression Recognition originals Type 7 occlusion: 70% of pixels at random Restored images 105 Ranzato, et al. CVPR

82 Facial Expression Recognition occluded images for bot training and test Dailey, et al. J. Cog. Neuros Wrigt, et al. PAMI Ranzato, et al. CVPR

84 109 Motivation: Multi-modal learning Single learning algoritms tat combine multiple input domains Images Audio & speec Video Text Robotic sensors Time-series data Oters image audio text

85 110 Motivation: Multi-modal learning Benefits: more robust performance in Multimedia processing Biomedical data mining fmri PET scan X-ray EEG Ultra sound Robot perception Visible ligt image Audio Termal Infrared Camera array 3d range scans

86 Sparse dictionary learning on audio Spectrogram ~ 0.9 * * * x [Lee, Largman, Pam, 63 Ng, NIPS 2009] 111

87 frequency Convolutional DBN for audio Max pooling node Detection nodes Spectrogram time [Lee, Largman, Pam, Ng, NIPS 2009] 113

88 frequency Convolutional DBN for audio Spectrogram time [Lee, Largman, Pam, Ng, NIPS 2009] 114

89 Convolutional DBN for audio [Lee, Largman, Pam, Ng, NIPS 2009] 115

90 Convolutional DBN for audio Max pooling Detection nodes Second CDBN layer Max pooling Detection nodes One CDBN layer [Lee, Largman, Pam, Ng, NIPS 2009] 116

91 CDBNs for speec Trained on unlabeled TIMIT corpus Learned first-layer bases [Lee, Largman, Pam, Ng, NIPS 2009] 117

92 Poneme First layer bases 118 Comparison of bases to ponemes oy el s

93 Experimental Results Speaker identification TIMIT Speaker identification Accuracy Prior art (Reynolds, 1995) 99.7% Convolutional DBN 100.0% Pone classification TIMIT Pone classification Accuracy Clarkson et al. (1999) 77.6% Petrov et al. (2007) 78.6% Sa & Saul (2006) 78.9% Yu et al. (2009) 79.2% Convolutional DBN 80.3% Transformation-invariant RBM (Son et al., ICML 2012) 81.5% [Lee, Pam, Largman, Ng, NIPS 2009] 120

94 Pone recognition using mcrbm Mean-covariance RBM + DBN Mean-covariance RBM [Dal, Ranzato, Moamed, Hinton, NIPS 2009] 122

95 Speec Recognition on TIMIT Metod PER Stocastic Segmental Models 36.0% Conditional Random Field 34.8% Large-Margin GMM 33.0% CD-HMM 27.3% Augmented conditional Random Fields 26.6% Recurrent Neural Nets 26.1% Bayesian Tripone HMM 25.6% Monopone HTMs 24.8% Heterogeneous Classifiers 24.4% Deep Belief Networks(DBNs) 23.0% Tripone HMMs discriminatively trained w/ BMMI 22.7% Deep Belief Networks wit mcrbm feature extraction 20.5% (Dal et al., NIPS 2010) 123

96 Multimodal Feature Learning Lip reading via multimodal feature learning (audio / visual data) Slide credit: Jiquan Ngiam 126

97 Multimodal Feature Learning Lip reading via multimodal feature learning (audio / visual data) Q. Is concatenating te best option? Slide credit: Jiquan Ngiam 127

98 Multimodal Feature Learning Concatenating and learning features (via a single layer)doesn t work Mostly unimodal features are learned 128

99 Multimodal Feature Learning Bimodal autoencoder Idea: predict unseen modality from observed modality Ngiam et al., ICML

100 Multimodal Feature Learning Visualization of learned filters Audio(spectrogram) and Video features learned over 100ms windows Results: AVLetters Lip reading dataset Metod Accuracy Prior art (Zao et al., 2009) 58.9% Multimodal deep autoencoder (Ngiam et al., 2011) 65.8% 130

101 131 Summary Learning Feature Representations Restricted Boltzmann Macines Deep Belief Networks Stacked Denoising Autoencoders Deep learning algoritms and unsupervised feature learning algoritms sow promising results in many applications vision, audio, multimodal data, and oters.

102 Tank you! 132