Introduction to Deep Learning and its applications in Computer Vision
|
|
|
- Nicholas Rose
- 9 years ago
- Views:
Transcription
1 Introduction to Deep Learning and its applications in Computer Vision Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong Morning Tutorial (9:00 12:30) March 23,
2 Outline Introduction to deep learning Deep learning for object recognition Deep learning for object segmentation Deep learning for object detection Open questions and future works 2
3 Part I: Introduction to Deep Learning Historical review of deep learning Introduction to classical deep models Why does deep learning work? 3
4 Machine Learning x F(x) y Vector Class label (Classification) (Estimation) Object recognition {dog, cat, horse, flower, } Super resolution High-resolution image Low-resolution image 4
5 Neural network Back propagation Nature 1986 Solve general learning problems Tied with biological system 5
6 Neural network Back propagation Nature 1986 g(x) w 1 w 2 w 3 x 1 x 2 x 3 f(net) 6
7 Neural network Back propagation Nature 1986 Solve general learning problems Tied with biological system But it is given up Hard to train Insufficient computational resources Small training sets Does not work well 7
8 Neural network Back propagation Nature SVM Boosting Decision tree KNN Flat structures Loose tie with biological systems Specific methods for specific tasks Hand crafted features (GMM-HMM, SIFT, LBP, HOG) Kruger et al. TPAMI 13 8
9 Neural network Back propagation Deep belief net Science Nature Unsupervised & Layer-wised pre-training Better designs for modeling and training (normalization, nonlinearity, dropout) New development of computer architectures GPU Multi-core computer systems Large scale databases Big Data! 9
10 Machine Learning with Big Data Machine learning with small data: overfitting, reducing model complexity (capacity) Machine learning with big data: underfitting, increasing model complexity, optimization, computation resource 10
11 How to increase model capacity? Curse of dimensionality Blessing of dimensionality Learning hierarchical feature transforms (Learning features with deep structures) D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: Highdimensional feature and its efficient compression for face verification. In Proc. IEEE Int l Conf. Computer Vision and Pattern Recognition,
12 Neural network Back propagation Nature Deep belief net Science Speech deep learning results Solve general learning problems Tied with biological system But it is given up 12
13 Neural network Back propagation Nature Deep belief net Science Speech Rank Name Error rate Description 1 U. Toronto Deep learning 2 U. Tokyo Hand-crafted 3 U. Oxford features and learning models. 4 Xerox/INRIA Bottleneck. Object recognition over 1,000,000 images and 1,000 categories (2 GPU) 13 A. Krizhevsky, L. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012.
14 Examples from ImageNet 14
15 Neural network Back propagation Deep belief net Science Speech ImageNet 2013 image classification challenge Rank Name Error rate Description 1 NYU Deep learning 2 NUS Deep learning 3 Oxford Deep learning MSRA, IBM, Adobe, NEC, Clarifai, Berkley, U. Tokyo, UCLA, UIUC, Toronto. Top 20 groups all used deep learning ImageNet 2013 object detection challenge Rank Name Mean Average Precision Description 1 UvA-Euvision Hand-crafted features 2 NEC-MU Hand-crafted features 3 NYU Deep learning 15
16 Neural network Back propagation Deep belief net Science Speech ImageNet 2014 Image classification challenge Rank Name Error rate Description 1 Google Deep learning 2 Oxford Deep learning 3 MSRA Deep learning ImageNet 2014 object detection challenge Rank Name Mean Average Precision Description 1 Google Deep learning 2 CUHK Deep learning 3 DeepInsight Deep learning 4 UvA-Euvision Deep learning 5 Berkley Vision Deep learning 16
17 Neural network Back propagation Deep belief net Science Speech ImageNet 2014 object detection challenge Model average Single model GoogLeNet (Google) DeepID-Net (CUHK) DeepInsight UvA- Euvision Berkley Vision RCNN n/a n/a n/a W. Ouyang et al. DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection, arxiv: ,
18 Neural network Back propagation Deep belief net Science Speech Google and Baidu announced their deep learning based visual search engines (2013) Google on our test set we saw double the average precision when compared to other approaches we had tried. We acquired the rights to the technology and went full speed ahead adapting it to run at large scale on Google s computers. We took cutting edge research straight out of an academic research lab and launched it, in just a little over six months. Baidu 18
19 Neural network Back propagation Deep belief net Science Speech Face recognition Deep learning achieves 99.47% face verification accuracy on Labeled Faces in the Wild (LFW), higher than human performance Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification-Verification. NIPS, Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. arxiv: ,
20 Labeled Faces in the Wild (2007) Best results without deep learning 20
21 21
22 22
23 Design Cycle start Domain knowledge Collect data Preprocessing Feature design Interest of people working on computer vision, speech recognition, medical image processing, Preprocessing and feature design may lose useful information and not be optimized, since they are not parts of an end-to-end learning system Preprocessing could be the result of another pattern recognition system Choose and design model Train classifier Evaluation end Interest of people working on machine learning Interest of people working on machine learning and computer vision, speech recognition, medical image processing, 23
24 Person re-identification pipeline Pedestrian detection Pose estimation Body parts segmentation Photometric & geometric transform Feature extraction Classification Face recognition pipeline Face alignment Geometric rectification Photometric rectification Feature extraction Classification 24
25 Design Cycle with Deep Learning start Collect data Learning plays a bigger role in the design circle Feature learning becomes part of the end-to-end learning system Preprocessing becomes optional means that several pattern recognition steps can be merged into one end-to-end learning system Feature learning makes the key difference Preprocessing (Optional) Design network Feature learning Classifier Train network Evaluation We underestimated the importance of data collection and evaluation end 25
26 What makes deep learning successful in computer vision? Li Fei-Fei Geoffrey Hinton Data collection Evaluation task Deep learning One million images with labels Predict 1,000 image categories CNN is not new Design network structure New training strategies Feature learned from ImageNet can be well generalized to other tasks and datasets! 26
27 Learning features and classifiers separately Not all the datasets and prediction tasks are suitable for learning features with deep models Training stage A Dataset A Dataset B Training stage B Deep learning feature transform feature transform Classifier 1 Classifier 2... Classifier B Prediction on task 1 Prediction on task 2... Prediction on task B (Our target task) 27
28 Deep learning can be treated as a language to described the world with great flexibility Collect data Collect data Preprocessing 1 Deep neural network Preprocessing 2 Feature design Classifier Connection Feature transform Feature transform Classifier Evaluation Evaluation 28
29 Introduction to Deep Learning Historical review of deep learning Introduction to classical deep models Why does deep learning work? 29
30 Introduction on Classical Deep Models Convolutional Neural Networks (CNN) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based Learning Applied to Document Recognition, Proceedings of the IEEE, Vol. 86, pp , Deep Belief Net (DBN) G. E. Hinton, S. Osindero, and Y. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, Vol. 18, pp , Auto-encoder G. E. Hinton and R. R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, Vol. 313, pp , July
31 Classical Deep Models Convolutional Neural Networks (CNN) First proposed by Fukushima in 1980 Improved by LeCun, Bottou, Bengio and Haffner in 1998 Convolution Pooling Learned filters 31
32 Backpropagation W is the parameter of the network; J is the objective function Target values Feedforward operation Output layer Hidden layers Back error propagation Input layer D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning Representations by Back-propagation Errors, Nature, Vol. 323, 32 pp , 1986.
33 Classical Deep Models Deep belief net Hinton 06 Pre-training: Good initialization point Make use of unlabeled data P(x,h 1,h 2 ) = p(x h 1 ) p(h 1,h 2 ) Initial point P( x, h 1 ) e E( x, h E( x, h e x, h 1 E(x,h 1 )=b' x+c' h 1 +h 1 ' Wx 1 ) 1 ) 33
34 Classical Deep Models Auto-encoder Hinton and Salakhutdinov 2006 ~ x Encoding: h 1 = σ(w 1 x+b 1 ) h 2 = σ(w 2 h 1 +b 2 ) Decoding: h ~ = σ(w 2 h 2 +b 3 ) ~1 x = σ(w 1 h 1 +b 4 ) W' 1 b 4 W' 2 b 3 W2 b 2 W1 b 1 h 2 h ~ 1 h 1 x 34
35 Introduction to Deep Learning Historical review of deep learning Introduction to classical deep models Why does deep learning work? 35
36 Feature Learning vs Feature Engineering 36
37 Feature Engineering The performance of a pattern recognition system heavily depends on feature representations Manually designed features dominate the applications of image and video understanding in the past Reply on human domain knowledge much more than data Feature design is separate from training the classifier If handcrafted features have multiple parameters, it is hard to manually tune them Developing effective features for new applications is slow 37
38 Handcrafted Features for Face Recognition 2 parameters 3 parameters Geometric features Pixel vector Gabor filters Local binary patterns 1980s
39 Feature Learning Learning transformations of the data that make it easier to extract useful information when building classifiers or predictors Jointly learning feature transformations and classifiers makes their integration optimal Learn the values of a huge number of parameters in feature representations Faster to get feature representations for new applications Make better use of big data 39
40 Deep Learning Means Feature Learning Deep learning is about learning hierarchical feature representations Data Trainable Feature Transform Trainable Feature Transform Trainable Feature Transform Trainable Feature Transform Classifier Good feature representations should be able to disentangle multiple factors coupled in the data Pixel n view Pixel 2 Ideal Feature Transform Pixel 1 40 expression
41 Deep Learning Means Feature Learning How to effectively learn features with deep models With challenging tasks Predict high-dimensional vectors Pre-train on classifying 1,000 categories Fine-tune on classifying 201 categories Detect 200 object classes on ImageNet Feature representation SVM binary classifier for each category W. Ouyang et al. DeepID-Net: multi-stage and deformable deep convolutional neural 41 networks for object detection, arxiv: , 2014
42 Training stage A Training stage B Training stage C Dataset A Dataset B Dataset C feature transform feature transform feature transform Fixed Classifier A Classifier B SVM Distinguish 1000 categories Distinguish 201 categories Distinguish one object class from all the negatives 42
43 Example 1: deep learning generic image features Hinton group s groundbreaking work on ImageNet They did not have much experience on general image classification on ImageNet It took one week to train the network with 60 Million parameters The learned feature representations are effective on other datasets (e.g. Pascal VOC) and other tasks (object detection, segmentation, tracking, and image retrieval) 43
44 96 learned low-level filters 44
45 Image classification result 45
46 Top hidden layer can be used as feature for retrieval 46
47 Example 2: deep learning face identity features by recovering canonical-view face images Reconstruction examples from LFW Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning Identity Preserving Face Space, ICCV
48 Deep model can disentangle hidden factors through feature extraction over multiple layers No 3D model; no prior information on pose and lighting condition Model multiple complex transforms Reconstructing the whole face is a much strong supervision than predicting 0/1 class label and helps to avoid overfitting Arbitrary view 48 Canonical view
49 49
50 Comparison on Multi-PIE -45 o -30 o -15 o +15 o +30 o +45 o Avg Pose LGBP [26] VAAM [17] FA-EGFC[3] x SA-EGFC[3] LE[4] + LDA x CRBM[9] + LDA x Ours x 50
51 Deep learning 3D model from 2D images, mimicking human brain activities Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning and Disentangling Face Representation by Multi-View 51 Perception, NIPS 2014.
52 Training stage A Training stage B Face images in arbitrary views Two face images in arbitrary views Deep learning Face identity features feature transform Fixed Regressor 1 Regressor 2... Linear Discriminant analysis Reconstruct view 1 Reconstruct view 2... The two images belonging to the same person or not Face reconstruction Face verification 52
53 Example 3: deep learning face identity features from predicting 10,000 classes At training stage, each input image is classified into 10,000 identities with 160 hidden identity features in the top layer The hidden identity features can be well generalized to other tasks (e.g. verification) and identities outside the training set As adding the number of classes to be predicted, the generalization power of the learned features also improves Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification- Verification. NIPS,
54 Training stage A Training stage B Dataset A Dataset B feature transform feature transform Fixed Classifier A Linear classifier B Distinguish 10,000 people The two images belonging to the same person or not Face identification Face verification 54
55 Deep Structures vs Shallow Structures (Why deep?) 55
56 Shallow Structures A three-layer neural network (with one hidden layer) can approximate any classification function Most machine learning tools (such as SVM, boosting, and KNN) can be approximated as neural networks with one or two hidden layers Shallow models divide the feature space into regions and match templates in local regions. O(N) parameters are needed to represent N regions SVM 56
57 Deep Machines are More Efficient for Representing Certain Classes of Functions Theoretical results show that an architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whose depth is matched to the task (Hastad 1986, Hastad and Goldmann 1991) It also means many more parameters to learn 57
58 Take the d-bit parity function as an example (X 1,..., X d ) X i is even d-bit logical parity circuits of depth 2 have exponential size (Andrew Yao, 1985) There are functions computable with a polynomial-size logic gates circuits of depth k that require exponential size when restricted to depth k -1 (Hastad, 1986) 58
59 Architectures with multiple levels naturally provide sharing and re-use of components 59 Honglak Lee, NIPS 10
60 Humans Understand the World through Multiple Levels of Abstractions We do not interpret a scene image with pixels Objects (sky, cars, roads, buildings, pedestrians) -> parts (wheels, doors, heads) -> texture -> edges -> pixels Attributes: blue sky, red car It is natural for humans to decompose a complex problem into sub-problems through multiple levels of representations 60
61 Humans Understand the World through Multiple Levels of Abstractions Humans learn abstract concepts on top of less abstract ones Humans can imagine new pictures by re-configuring these abstractions at multiple levels. Thus our brain has good generalization can recognize things never seen before. Our brain can estimate shape, lighting and pose from a face image and generate new images under various lightings and poses. That s why we have good face recognition capability. 61
62 Local and Global Representations 62
63 Human Brains Process Visual Signals through Multiple Layers A visual cortical area consists of six layers (Kruger et al. 2013) 63
64 Joint Learning vs Separate Learning Training or manual design Training or manual design Manual design Data collection Preprocessing step 1 Preprocessing step 2 Feature extraction Classification??? Data collection Feature transform Feature transform Feature transform Classification End-to-end learning Deep learning is a framework/language but not a black-box model Its power comes from joint optimization and increasing the capacity of the learner 64
65 Domain knowledge could be helpful for designing new deep models and training strategies How to formulate a vision problem with deep learning? Make use of experience and insights obtained in CV research Sequential design/learning vs joint learning Effectively train a deep model (layerwise pre-training + fine tuning) Feature extraction Quantization (visual words) Spatial pyramid (histograms in local regions) Classification Feature extraction filtering Conventional object recognition scheme Quantization filtering Spatial pyramid multi-level pooling Filtering & max pooling Filtering & max pooling Filtering & max pooling Krizhevsky NIPS 12 65
66 What if we treat an existing deep model as a black box in pedestrian detection? ConvNet U MS Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, Pedestrian Detection with Unsupervised Multi-Stage Feature Learning, CVPR
67 Results on Caltech Test Results on ETHZ 67
68 N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, (6000 citations) P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, (2000 citations) W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling. CVPR,
69 Our Joint Deep Learning Model W. Ouyang and X. Wang, Joint Deep Learning for Pedestrian Detection, Proc. ICCV,
70 Modeling Part Detectors Design the filters in the second convolutional layer with variable sizes Part models learned from HOG Part models Learned filtered at the second convolutional layer 70
71 Deformation Layer 71
72 Visibility Reasoning with Deep Belief Net Correlates with part detection score 72
73 Average miss rate ( %) Experimental Results Caltech Test dataset (largest, most widely used)
74 Average miss rate ( %) Experimental Results Caltech Test dataset (largest, most widely used) %
75 Average miss rate ( %) Experimental Results Caltech Test dataset (largest, most widely used) % 90 68%
76 Average miss rate ( %) Experimental Results Caltech Test dataset (largest, most widely used) % % 63% (state-of-the-art)
77 Average miss rate ( %) Experimental Results Caltech Test dataset (largest, most widely used) % % 63% (state-of-the-art) 53% % (best performing) Improve by ~ 20% W. Ouyang and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling, CVPR W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship in Pedestrian Detection ", CVPR W. Ouyang, Xiaogang Wang, "Single-Pedestrian Detection aided by Multi-pedestrian Detection ", CVPR X. Zeng, W. Ouyang and X. Wang, A Cascaded Deep Learning Architecture for Pedestrian Detection, ICCV W. Ouyang and Xiaogang Wang, Joint Deep Learning for Pedestrian Detection, IEEE ICCV
78 DN-HOG UDN-HOG UDN-HOGCSS UDN-CNNFeat UDN-DefLayer 78
79 Large learning capacity makes high dimensional data transforms possible, and makes better use of contextual information 79
80 How to make use of the large learning capacity of deep models? High dimensional data transform Hierarchical nonlinear representations SVM + feature smoothness, shape prior Output? High-dimensional data transform Input 80
81 Face Parsing P. Luo, X. Wang and X. Tang, Hierarchical Face Parsing via Deep Learning, CVPR
82 Motivations Recast face segmentation as a cross-modality data transformation problem Cross modality autoencoder Data of two different modalities share the same representations in the deep model Deep models can be used to learn shape priors for segmentation 82
83 Training Segmentators 83
84 84
85 Big data Rich information Challenging supervision task with rich predictions How to make use of it? Capacity Hierarchical feature learning Reduce capacity Capture contextual information Go deeper Joint optimization Domain knowledge Take large input Make learning more efficient Go wider 85
86 Summary Automatically learns hierarchical feature representations from data and disentangles hidden factors of input data through multi-level nonlinear mappings For some tasks, the expressive power of deep models increases exponentially as their architectures go deep Jointly optimize all the components in a vision and crate synergy through close interactions among them Benefitting the large learning capacity of deep models, we also recast some classical computer vision challenges as highdimensional data transform problems and solve them from new perspectives It is more effective to train deep models with challenging tasks and rich predictions 86
87 References D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning Representations by Backpropagation Errors, Nature, Vol. 323, pp , N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J. Rodriguez- Sanchez, L. Wiskott, Deep Hierarchies in the Primate Visual Cortex: What Can We Learn For Computer Vision? IEEE Trans. PAMI, Vol. 35, pp , A. Krizhevsky, L. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Proc. NIPS, Y. Sun, X. Wang, and X. Tang, Deep Learning Face Representation by Joint Identification-Verification, NIPS, K. Fukushima, Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position, Biological Cybernetics, Vol. 36, pp , Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based Learning Applied to Document Recognition, Proceedings of the IEEE, Vol. 86, pp , G. E. Hinton, S. Osindero, and Y. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, Vol. 18, pp ,
88 G. E. Hinton and R. R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, Vol. 313, pp , July Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning Identity Face Space, Proc. ICCV, Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning and Disentangling Face Representation by Multi-View Perception, NIPS Y. Sun, X. Wang, and X. Tang, Deep Learning Face Representation from Predicting 10,000 classes, Proc. CVPR, J. Hastad, Almost Optimal Lower Bounds for Small Depth Circuits, Proc. ACM Symposium on Theory of Computing, J. Hastad and M. Goldmann, On the Power of Small-Depth Threshold Circuits, Computational Complexity, Vol. 1, pp , A. Yao, Separating the Polynomial-time Hierarchy by Oracles, Proc. IEEE Symposium on Foundations of Computer Science, Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, Pedestrian Detection with Unsupervised Multi-Stage Feature Learning, CVPR W. Ouyang and X. Wang, Joint Deep Learning for Pedestrian Detection, Proc. ICCV, P. Luo, X. Wang and X. Tang, Hierarchical Face Parsing via Deep Learning, Proc. CVPR, Honglak Lee, Tutorial on Deep Learning and Applications, NIPS
89 Outline Introduction to deep learning Deep learning for object recognition Deep learning for object segmentation Deep learning for object detection Open questions and future works 89
90 Part II: Deep Learning Object Recognition Deep learning for object recognition on ImageNet Deep learning for face recognition Learn identity features from joint verificationidentification signals Learn 3D face models from 2D images 90
91 CNN for Object Recognition on ImageNet Krizhevsky, Sutskever, and Hinton, NIPS 2012 Trained on one million images of 1000 categories collected from the web with two GPUs; 2GB RAM on each GPU; 5GB of system memory Training lasts for one week Rank Name Error rate Description 1 U. Toronto Deep learning 2 U. Tokyo Hand-crafted 3 U. Oxford features and learning models. 4 Xerox/INRIA Bottleneck. 91
92 Model Architecture Max-pooling layers follow 1 st, 2 nd, and 5 th convolutional layers The number of neurons in each layer is given by , , 64896, 43264, 4096, 4096, neurons, 60 million parameters, 630 million connections 92
93 Normalization Normalize the input by subtracting the mean image on the training set Input image (256 x 256) Mean image 93 Krizhevsky 2012
94 Activation Function Rectified linear unit leads to sparse responses of neurons, such that weights can be effectively updated with BP Sigmoid (slow to train) Rectified linear unit (quick to train) 94 Krizhevsky 2012
95 Data Augmentation The neural net has 60M parameters and it overfits Image regions are randomly cropped with shift; their horizontal reflections are also included 95 Krizhevsky 2012
96 Dropout Randomly set some input features and the outputs of hidden units as zero during the training process Feature co-adaptation: a feature is only helpful when other specific features are present Because of the existence of noise and data corruption, some features or the responses of hidden nodes can be misdetected Dropout prevents feature co-adaptation and can significantly improve the generalization of the trained network Can be considered as another approach to regularization It can be viewed as averaging over many neural networks Slower convergence 96
97 Classification Result 97 Krizhevsky 2012
98 Detection Result 98 Krizhevsky 2012
99 Image Retrieval 99 Krizhevsky 2012
100 Adaptation to Smaller Datasets Directly use the feature representations learned from ImageNet and replace handcrafted features with them in image classification, scene recognition, fine grained object recognition, attribute recognition, image retrieval (Razavian et al. 2014, Gong et al. 2014) Use ImageNet to pre-train the model (good initialization), and use target dataset to fine-tune it (Girshick et al. CVPR 2014) Fix the bottom layers and only fine tune the top layers 100
101 GoogLeNet More than 20 layers Add supervision at multiple layers The error rate is reduced from 15.3% to 6.6% 101
102 Deep Learning Object Recognition Deep learning for object recognition on ImageNet Deep learning for face recognition Learn identity features from joint verificationidentification signals Learn 3D face models from 2D images 102
103 Deep Learning Results on LFW Method Accuracy (%) # points # training images Huang et al. CVPR 12 87% 3 Unsupervised Sun et al. ICCV % 5 87,628 DeepFace (CVPR 14) 97.35% ,000,000 Sun et al. (CVPR 14) 97.45% 5 202,599 Sun et al. (NIPS 14) 99.15% ,599 New: DeepID2+ (arxiv 14) 99.47% ,000 The first deep learning work on face recognition was done by Huang et al. in With unsupervised learning, the accuracy was 87% Our work at ICCV 13 achieved result (92.52%) comparable with state-of-the-art Our work at CVPR 14 reached 97.45% close to human cropped performance (97.53%) DeepFace developed by Facebook also at CVPR 14 used 73-point 3D face alignment and 7 million training data (35 times larger than us) Our most recent work reached 99.15% close to human funneled performance (99.20%) Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. arxiv: ,
104 Closed- and open-set face identification on LFW Method Rank-1 (%) 1% FAR (%) COST-S1 [1] COST-S1+s2 [1] DeepFace [2] DeepFace+ [3] DeepID2 [4] DeepID2+ [5] [1] L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: Identifying a person of interest from a media collection. TR MSU-CSE-14-1, [2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level performance in face verifica- tion. In Proc. CVPR, [3] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web- scale training for face identification. Technical report, arxiv: , [4] Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification- Verification. NIPS, [5] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. arxiv: ,
105 Eternal Topic on Face Recognition Intra-personal variation Inter-personal variation How to separate the two types of variations? 105
106 Are they the same person or not? Nicole Kidman Nicole Kidman 106
107 Are they the same person or not? Coo d Este Melina Kanakaredes 107
108 Are they the same person or not? Elijah Wood Stefano Gabbana 108
109 Are they the same person or not? Jim O Brien Jim O Brien 109
110 Are they the same person or not? Jacquline Obradors Julie Taymor 110
111 Out of 6000 image pairs on the LFW test set, 51 pairs are misclassified with the deep model We randomly mixed them and presented them to 10 Chinese subjects for evaluation. Their averaged verification accuracy is 56%, close to random guess (50%) 111
112 Linear Discriminate Analysis 112
113 LDA seeks for linear feature mapping which maximizes the distance between class centers under the constraint what the intrapersonal variation is constant 113
114 Deep Learning for Face Recognition Extract identity preserving features through hierarchical nonlinear mappings Model complex intra- and inter-personal variations with large learning capacity 114
115 Learn Identity Features from Different Supervisory Tasks Face identification: classify an image into one of N identity classes multi-class classification problem Face verification: verify whether a pair of images belong to the same identity or not binary classification problem 115
116 Minimize the intra-personal variation under the constraint that the distance between classes is constant (i.e. contracting the volume of the image space without reducing the distance between classes) (1, 0, 0) (0, 1, 0) (0, 0, 1) 116
117 Learn Identity Features with Verification Signal Extract relational features with learned filter pairs These relational features are further processed through multiple layers to extract global features The fully connected layer can be used as features to combine with multiple ConvNets 117 Y. Sun, X. Wang, and X. Tang, Hybrid Deep Learning for Computing Face Similarities, Proc. ICCV, 2013.
118 Results on LFW Unrestricted protocol without outside training data 118
119 Results on LFW Unrestricted protocol using outside training data 119
120 DeepID: Learn Identity Features with Identification Signal (1, 0, 0) (0, 1, 0) (0, 0, 1) 120 Y. Sun, X. Wang, and X. Tang, Deep Learning Face Representation from Predicting 10,000 classes, Proc. CVPR, 2014.
121 During training, each image is classified into 10,000 identities with 160 identity features in the top layer These features keep rich inter-personal variations Features from the last two convolutional layers are effective The hidden identity features can be well generalized to other tasks (e.g. verification) and identities outside the training set 121
122 High-dimensional prediction is more challenging, but also adds stronger supervision to the network As adding the number of classes to be predicted, the generalization power of the learned features also improves 122
123 Extract Features from Multiple ConvNets 123
124 Learn Identity Features with Identification Signal After combining hidden identity features from multiple CovNets and further reducing dimensionality with PCA, each face image has 150- dimenional features as signature These features can be further processed by other classifiers in face verification. Interestingly, we find Joint Bayesian is more effective than cascading another neural network to classify these features 124
125 DeepID2: Joint Identification- Verification Signals Every two feature vectors extracted from the same identity should are close to each other f i and f j are feature vectors extracted from two face images in comparison y ij = 1 means they are from the same identity; y ij = -1means different identities m is a margin to be learned Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification-Verification. 125 NIPS, 2014.
126 Balancing Identification and Verification Signals with Parameter λ λ = 0: only identification signal λ = + : only verification signal 126 Figure 3: Face verification accuracy by varying the weighting parameter λ. λ is plotted in log scale. Figure 4: Fac learned by bo verification si
127 Rich Identity Information Improves Feature Learning Face verification accuracies with the number of training identities 127
128 Summary of DeepID2 25 face regions at different scales and locations around landmarks are selected to build 25 neural networks All the 160 X 25 hidden identity features are further compressed into a 180-dimensional feature vector with PCA as a signature for each image With a single Titan GPU, the feature extraction process takes 35ms per image 128
129 DeepID2+ Larger net work structures Larger training data Adding supervisory signals at every layer Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. 129 arxiv: , 2014.
130 Compare DeepID2 and DeepID2+ on LFW Comparison of face verification accuracies on LFW with ConvNets trained on 25 face regions given in DeepID2 Best single model is improved from 96.72% to 98.70% 130
131 Final Result on LFW Methods High-dim LBP [1] TL Joint Bayesian [2] DeepFace [3] DeepID [4] DeepID2 [5] DeepID2+ [6] Accuracy (%) [1] Chen, Cao, Wen, and Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. CVPR, [2] Cao, Wipf, Wen, Duan, and Sun. A practical transfer learning algorithm for face verification. ICCV, [3] Taigman, Yang, Ranzato, and Wolf. DeepFace: Closing the gap to human-level performance in face verification. CVPR, [4] Sun, Wang, and Tang. Deep learning face representation from predicting 10,000 classes. CVPR, [5] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification-Verification. NIPS, [6] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. arxiv: ,
132 Closed- and open-set face identification on LFW Method Rank-1 (%) 1% FAR (%) COST-S1 [1] COST-S1+s2 [1] DeepFace [2] DeepFace+ [3] DeepID DeepID [1] L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: Identifying a person of interest from a media collection. TR MSU-CSE-14-1, [2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level performance in face verifica- tion. In Proc. CVPR, [3] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web- scale training for face identification. Technical report, arxiv: ,
133 Face Verification on YouTube Faces Methods Accuracy (%) LM3L [1] 81.3 ± 1.2 DDML (LBP) [2] 81.3 ± 1.6 DDML (combined) [2] 82.3 ± 1.5 EigenPEP [3] 84.8 ± 1.4 DeepFace [4] 91.4 ± 1.1 DeepID ± 0.2 [1] J. Hu, J. Lu, J. Yuan, and Y. P. Tan, Large margin multi-metric learning for face and kinship verification in the wild, ACCV 2014 [2] J. Hu, J. Lu, and Y. P. Tan, Discriminative deep metric learning for face verification in the wild, CVPR 2014 [3] H. Li, G. Hua, X. Shen, Z. Lin, and J. Brandt, Eigen-pep for video face recognition, ACCV 2014 [4] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, DeepFace: Closing the gap to humanlevel performance in face verification, CVPR
134 GoogLeNet Inter-class variation High dimensional image space Sigmoid Rectified linear unit Linear transform Pooling Nonlinear mapping 134
135 Unified subspace analysis Joint deep learning Identification signal is in S b ; verification signal is in S w Learn features by joint identification-verification Maximize distance between classes under constraint that intrapersonal variation is constant Minimize intra-personal variation under constraint that the distance between classes is constant Linear feature mapping Hierarchical nonlinear feature extraction Generalization power increases with more training identities 135
136 What has been learned by DeepID2+? Properties owned by neurons? Moderate sparse Selective to identities and attributes Robust to data corruption These properties are naturally owned by DeepID2+ through large-scale training, without explicitly adding regularization terms to the model 136
137 Biological Motivation Monkey has a face-processing network that is made of six interconnected face-selective regions Neurons in some of these regions were view-specific, while some others were tuned to identity across views View could be generalized to other factors, e.g. expressions? Winrich A. Freiwald and Doris Y. Tsao, Functional compartmentalization and viewpoint generalization within the macaque face-processing system, Science, 330(6005): ,
138 Deeply learned features are moderately space For an input image, about half of the neurons are activated An neuron has response on about half of the images 138
139 Deeply learned features are moderately space The binary codes on activation patterns of neurons are very effective on face recognition Activation patterns are more important than activation magnitudes in face recognition Joint Bayesian (%) Hamming distance (%) Single model (real values) Single model (binary code) Combined model (real values) Combined model (binary code) n/a n/a
140 Deeply learned features are selective to identities and attributes With a single neuron, DeepID2 reaches 97% recognition accuracy for some identity and attribute 140
141 Deeply learned features are selective to identities and attributes With a single neuron, DeepID2 reaches 97% recognition accuracy for some identity and attribute Identity classification accuracy on LFW with one single DeepID2+ or LBP feature. GB, CP, TB, DR, and GS are five celebrities with the most images in LFW. Attribute classification accuracy on LFW with one single DeepID2+ or LBP feature. 141
142 Deeply learned features are selective to identities and attributes Excitatory and inhibitory neurons Histograms of neural activations over identities with the most images in LFW 142
143 Deeply learned features are selective to identities and attributes Excitatory and inhibitory neurons Histograms of neural activations over gender-related attributes (Male and Female) Histograms of neural activations over race-related attributes (White, Black, Asian and India) 143
144 Histogram of neural activations over age-related attributes (Baby, Child, Youth, Middle Aged, and Senior) Histogram of neural activations over hair-related attributes (Bald, Black Hair, Gray Hair, Blond Hair, and 144 Brown Hair.
145 DeepID2+ High-dim LBP 145
146 DeepID2+ High-dim LBP 146
147 Deeply learned features are selective to identities and attributes Visualize the semantic meaning of each neuron 147
148 Deeply learned features are selective to identities and attributes Visualize the semantic meaning of each neuron Neurons are ranked by their responses in descending order with respect to test images 148
149 DeepID2 features for attribute recognition Features at top layers are more effective on recognizing identity related attributes Features at lowers layers are more effective on identity-nonrelated attributes 149
150 DeepID2 features for attribute recognition DeepID2 features can be directly used for attribute recognition Use DeeID2 features as initialization (pre-trained result), and then fine tune on attribute recognition Average accuracy on 40 attributes on CelebA and LFWA datasets CelebA LFWA FaceTracer [1] (HOG+SVM) PANDA-W [2] (Parts are automatically detected) PANDA-L [2] (Parts are given by ground truth) DeepID Fine-tune (w/o DeepID2) DeepID2 + fine-tune Z. Liu, P. Luo, X. Wang, and X. Tang, Deep Learning Face Attributes in the Wild, arxiv: , 2014.
151 Deeply learned features are robust to occlusions Global features are more robust to occlusions 151
152 152
153 Outline Deep learning for object recognition on ImageNet Deep learning for face recognition Learn identity features from joint verificationidentification signals Learn 3D face models from 2D images 153
154 Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning and Disentangling Face Representation by Multi-View Perception, 154 NIPS Deep Learning Multi-view Representation from 2D Images Inspired by brain behaviors [Winrich et al. Science 2010] Identity and view represented by different sets of neurons Given an image under arbitrary view, its viewpoint can be estimated and its full spectrum of views can be reconstructed
155 Deep Learning Multi-view Representation from 2D Images x and y are input and ouput images of the same identity but in different views; v is the view label of the output image; h id are neurons encoding identity features h v are neurons encoding view features h r are neurons encoding features to reconstruct the output images 155
156 Face recognition accuracies across views and illuminations on the Multi-PIE dataset. The first and the second best performances are in bold. 156
157 Deep Learning Multi-view Representation from 2D Images Interpolate and predict images under viewpoints unobserved in the training set The training set only has viewpoints of 0 o, 30 o, and 60 o. (a): the reconstructed images under 15 o and 45 o when the input is taken under 0 o. (b) The input images are under 15 o and 45 o. 157
158 Outline Introduction to deep learning Deep learning for object recognition Deep learning for object segmentation Deep learning for object detection Open questions and future works 158
159 Whole-image classification vs pixelwise classification Whole-image classification: predict a single label for the whole image Pixelwise classification: predict a label at every pixel Segmentation, detection, and tracking CNN, forward and backward propagation were originally proposed for whole-image classification Such difference was ignored when CNN was applied to pixelwise classification problems, therefore it encountered efficiency problems 159
160 Pixelwise Classification Image patches centered at each pixel are used as the input of a CNN, and the CNN predicts a class label for each pixel A lot of redundant computation because of overlap between patches Image patches around each pixel location Class label for each pixel Farabet et al. TPAMI 2013 Pinheiro and Collobert ICML
161 R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic 161 Segmentation CVPR 2014 Classify Segmentation Proposal Determines which segmentation proposal can best represent objects on interest Segmentation Proposals CNN Classifier Person Bottle Bottle Chair
162 Direct Predict Segmentation Maps P. Luo, X. Wang, and X. Tang, Pedestrian Parsing via Deep Decompositional Network, ICCV
163 Direct Predict Segmentation Maps Classifier is location sensitive has no translation invariance Prediction not only depends on the neighborhood of the pixel, but also its location Only suitable for images with regular structures, such as faces and humans 163
164 Efficient Forward-Propagation of Convolutional Neural Networks Generate the same result as patch-by-patch scanning, with 1500 times speedup for both forward and backward propagation H. Li, R. Zhao, and X. Wang, Highly Efficient Forward 164 and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification, arxiv: , 2014
165 Speedup = ( s m /( s m) ) O s 2 is image size and m 2 is patch size The layewise timing and speedup results of the forward and backward propagation 165 by our proposed algorithm on the RCNN model with 3X410X410 images as inputs.
166 Fully convolutional neural network Replace fully connected layers in CNN with 1 x 1 convolution kernel just like network in network (Lin, Chen and Yan, arxiv 2013) Take the whole images as inputs and directly output segmentation map Has translation invariance like patch-by-patch scanning, but with much lower computational cost Once FCNN is learned, it can process input images of any sizes without warping them to a standard size 166 K. Kang and X. Wang, Fully Convolutional Neural Networks for Crowd Segmentation, arxiv: , 2014
167 Fully convolutional neural network Convolution-pooling layers Fully connected layers Fusion convolutional layers 167 implemented by 1 x 1 kernel
168 Summary Deep learning significantly outperforms conventional vision systems on large scale image classification Feature representation learned from ImageNet can be well generalized to other tasks and datasets In face recognition, identity preserving features can be effectively learned by joint identification-verification signals 3D face models can be learned from 2D images; identity and pose information is encoded by different sets of neurons In segmentation, larger patches lead to better performance because of the large learning capacity of deep models. It is also possible to directly predict the segmentation map. The efficiency of CNN based segmentation can be significantly improved by considering the differences between wholeimage classification and pixelwise classification 168
169 References A. Krizhevsky, L. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Proc. NIPS, G. B. Huang, H. Lee, and E. Learned-Miller, Learning Hierarchical Representation for Face Verification with Convolutional Deep Belief Networks, Proc. CVPR, Y. Sun, X. Wang, and X. Tang, Hybrid Deep Learning for Computing Face Similarities, Proc. ICCV, Y. Sun, X. Wang, and X. Tang, Deep Learning Face Representation from Predicting 10,000 classes, Proc. CVPR, Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, Proc. CVPR, Y. Sun, X. Wang, and X. Tang, Deep Learning Face Representation by Joint Identification-Verification, NIPS, A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition, arxiv preprint arxiv: , Y. Gong, L. Wang, R. Guo, and S. Lazebnik, Multi-Scale Orderless Pooling of Deep Convolutional Activation Features, arxiv preprint arxiv: ,
170 M. Turk and A. Pentland, Eigenfaces for Recognition, Journal of Cognitive Neuroscience, Vol. 3, pp , P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, TPAMI, Vol. 19, pp , B. Moghaddam, T. Jebara, and A. Pentland, Bayesian Face Recognition, Pattern Recognition, Vol. 33, pp , X. Wang and X. Tang, A Unified Framework for Subspace Face Recognition, TPAMI, Vol. 26, pp , Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning and Disentangling Face Representation by Multi-View Perception, NIPS C. Farabet, C. Couprie, L. Najman, and Y. LeCun, Learning Hierarchical Features for Scene Labeling, TPAMI, Vol. 35, pp , P. O. Pinheiro and R. Collobert, Recurrent Convolutional Neural Networks for Scene Labeling, Proc. ICML R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation CVPR 2014 P. Luo, X. Wang, and X. Tang, Pedestrian Parsing via Deep Decompositional Network, ICCV Winrich A. Freiwald and Doris Y. Tsao, Functional compartmentalization and viewpoint generalization within the macaque face-processing system, Science, 330(6005): , Shay Ohayon, Winrich A. Freiwald, and Doris Y. Tsao. What makes a cell face selective? the importance of contrast. Neuron, 74: ,
171 Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. arxiv: , Z. Liu, P. Luo, X. Wang, and X. Tang, Deep Learning Face Attributes in the Wild, arxiv: , H. Li, R. Zhao, and X. Wang, Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification, arxiv: , K. Kang and X. Wang, Fully Convolutional Neural Networks for Crowd Segmentation, arxiv: ,
172 Outline Introduction to deep learning Deep learning for object recognition Deep learning for object segmentation Deep learning for object detection Open questions and future works 172
173 Part IV: Deep Learning for Object Detection Pedestrian Detection Human part localization General object detection Object detection Deep learning Face alignment Pedestrian detection Human pose estimation 173
174 Part IV: Deep Learning for Object Detection Jointly optimize the detection pipeline Multi-stage deep learning (cascaded detectors) Mixture components Integrate segmentation and detection to depress background clutters Contextual modeling Pre-training Model deformation of object parts, which are shared across classes 174
175 Joint Deep Learning: Jointly optimize the detection pipeline 175
176 What if we treat an existing deep model as a black box in pedestrian detection? ConvNet U MS Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, Pedestrian Detection with Unsupervised Multi-Stage Feature Learning, CVPR
177 Results on ETHZ Results on Caltech Test 177
178 N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, (6000 citations) P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, (2000 citations) W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling. CVPR,
179 Our Joint Deep Learning Model Feature extraction Deformation handling W. Ouyang and X. Wang, Joint Deep Learning for Pedestrian Detection, Proc. ICCV,
180 Modeling Part Detectors Design the filters in the second convolutional layer with variable sizes Part models learned from HOG Part models Learned filtered at the second convolutional layer 180
181 Our Joint Deep Learning Model Deformation handling 181
182 Deformation Layer 182
183 Visibility Reasoning with Deep Belief Net 183
184 Results on Caltech Test Results on ETHZ 184
185 DN-HOG UDN-HOG UDN-HOGCSS UDN-CNNFeat UDN-DefLayer 185
186 Multi-Stage Contextual Deep Learning: Train different detectors for different types of samples Model contextual information Stage-by-stage pretraining strategies 186 X. Zeng, W. Ouyang and X. Wang, "Multi-Stage Contextual Deep Learning for Pedestrian Detection," ICCV 2013
187 Motivated by Cascaded Classifiers and Contextual Boost The classifier of each stage deals with a specific set of samples The score map output by one classifier can serve as contextual information for the next classifier Only pass one detection score to the next stage Classifiers are trained sequentially Conventional cascaded classifiers for detection 187
188 Simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage Cascaded classifiers are jointly optimized instead of being trained sequentially The deep model keeps the score map output by the current classifier and it serves as contextual information to support the decision at the next stage To avoid overfitting, a stage-wise pre-training scheme is proposed to regularize optimization 188
189 Training Strategies Unsupervised pre-train W h,i+1 layer-by-layer, setting W s,i+1 = 0, F i+1 = 0 Fine-tune all the W h,i+1 with supervised BP Train F i+1 and W s,i+1 with BP stage-by-stage A correctly classified sample at the previous stage does not influence the update of parameters Stage-by-stage training can be considered as adding regularization constraints to parameters, i.e. some parameters are constrained to be zeros in the early training stages Log error function: Gradients for updating parameters: 189
190 Experimental Results Caltech ETHZ 190
191 DeepNetNoneFilter 191
192 Comparison of Different Training Strategies Network-BP: use back propagation to update all the parameters without pre-training PretrainTransferMatrix-BP: the transfer matrices are unsupervised pertrained, and then all the parameters are fine-tuned Multi-stage: our multi-stage training strategy 192
193 Switchable Deep Network Use mixture components to model complex variations of body parts Use salience maps to depress background clutters Help detection with segmentation information P. Luo, Y. Tian, X. Wang, and X. Tang, "Switchable Deep Network for Pedestrian Detection", CVPR
194 Switchable Deep Network for Pedestrian Detection Background clutter and large variations of pedestrian appearance. Proposed Solution. A Switchable Deep Network (SDN) for learning the foreground map and removing the effect background clutter. 194
195 Switchable Deep Network for Pedestrian Detection Switchable Restricted Boltzmann Machine 195
196 Switchable Deep Network for Pedestrian Detection Switchable Restricted Boltzmann Machine Background Foreground 196
197 Switchable Deep Network for Pedestrian Detection (a) Performance on Caltech Test (b) Performance on ETH 197
198 Human Part Localization Contextual information is important to segmentation as well as detection 198
199 Human part localization Facial Keypoint Detection Human pose estimation Sun et al. CVPR 13 Ouyang et al. CVPR
200 Facial Keypoint Detection Y. Sun, X. Wang and X. Tang, Deep Convolutional Network Cascade for Facial Point Detection, CVPR
201 201
202 Comparison with Belhumeur et al. [4], Cao et al. [5] on LFPW test images O. Jesorsky, K. J. Kirchberg, and R. Frischholz. Robust face detection using the hausdorff distance. In Proc. AVBPA, P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Proc. CVPR, X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In Proc. CVPR, L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component-based discriminative search. In Proc. ECCV, M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Proc. CVPR, 2010.
203 Validation. BioID. LFPW. 203
204 Benefits of Using Deep Model The first network that takes the whole face as input needs deep structures to extract high-level features Take the full face as input to make full use of texture context information over the entire face to locate each keypoint Since the networks are trained to predict all the keypoints simultaneously, the geometric constraints among keypoints are implicitly encoded 204
205 Human pose estimation W. Ouyang, X. Chu and X. Wang, Multi-source Deep Learning for Human Pose Estimation CVPR
206 Multiple information sources Appearance 206
207 Multiple information sources Appearance Appearance mixture type 207
208 Multiple information sources Appearance Appearance mixture type Deformation 208
209 Multi-source deep model 209
210 Experimental results PARSE Method Torso U.leg L.leg U.arm L.arm head Total Yang&Ramanan CVPR Multi-source deep learning UIUC People Method Torso U.leg L.leg U.arm L.arm head Total Yang&Ramanan CVPR Multi-source deep learning LSP Method Torso U.leg L.leg U.arm L.arm head Total Yang&Ramanan CVPR Multi-source deep learning Up to 8.6 percent accuracy improvement with global geometric constraints 210
211 Experimental results Left: mixtire-of-parts (Yang&Ramanan CVPR 11) Right: Multi-source deep learning 211
212 General Object Detection Pretraining Model deformation of object parts, which are shared across classes Contextual modeling 212
213 Object detection Pascal VOC ~ 20 object classes Training: ~ 5,700 images Testing: ~10,000 images Image-net ILSVRC ~ 200 object classes Training: ~ 395,000 images Testing: ~ 40,000 images 213
214 SIFT, HOG, LBP, DPM [Regionlets. Wang et al. ICCV 13] [SegDPM. Fidler et al. CVPR 13] 214
215 With CNN features R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR,
216 R-CNN: regions + CNN features Input image Extract region proposals (~2k/image) Compute CNN features Region: 91.6%/98% recall rate on ImageNet/PASCAL Selective Search [van de Sande, Uijlings et al. IJCV 2013]. 2-class linear SVM Deep model from Krizhevsky, Sutskever & Hinton. NIPS 2012 SVM: Liblinear 216
217 RCNN: deep model training Pretrain for the 1000-way ILSVRC image classification task (1.2 million images) Fine-tune the CNN for detection Transfer the representation learned from ILSVRC Classification to PASCAL (or ImageNet) detection ImageNetCls Pre-Train Det Fine-tune Network from Krizhevsky, Sutskever & Hinton. NIPS 2012 Also called AlexNet 217
218 Experimental results on ILSVRC
219 Experimental results on ILSVRC 2014 Model average Single model GoogLeNet (Google) DeepID-Net (CUHK) DeepInsight UvA- Euvision Berkley Vision RCNN n/a n/a n/a
220 DeepID-Net: deformable deep convolutional neural networks for generic object detection W. Ouyang, et al. DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection, 220 arxiv: , 2014.
221 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image Proposed bounding boxes Detection results Refined bounding boxes 221
222 RCNN Mean ap 31.4 to (new result on ) Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 222
223 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 223
224 DeepID-Net 224
225 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 225
226 Deep model training pretrain RCNN (Cls+Det) Pretrain on image-level annotation with 1000 classes Finetune on object-level annotation with 200 classes Gap: classification vs. detection, 1000 vs. 200 Image classification Object detection 226
227 Investigation Result and discussion Better pretraining on 1000 classes Object-level annotation is more suitable for pretraining Conclusions The supervisory tasks should match at the pre-training and fine-turning stages Although an application only involves detecting a small number of classes, it is better to pretraing with many classes outside the application Image annotation Object annotation 200 classes (Det) classes (Cls-Loc)
228 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 228
229 Deformation Learning deformation [a] is effective in computer vision society. Missing in deep model. We propose a new deformation constrained pooling layer. [a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 32: ,
230 Deformation Layer [b] [b] Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", ICCV
231 Modeling Part Detectors Different parts have different sizes Design the filters with variable sizes Part models learned from HOG Part models Learned filtered at the second convolutional layer 231
232 Deformation layer for repeated patterns Pedestrian detection Assume no repeated pattern General object detection Repeated patterns 232
233 Deformation layer for repeated patterns Pedestrian detection Assume no repeated pattern Only consider one object class General object detection Repeated patterns Patterns shared across different object classes 233
234 Deformation constrained pooling layer Can capture multiple patterns simultaneously 234
235 Our deep model with deformation layer Patterns shared across different classes Training scheme Cls+Det Loc+Det Loc+Det Net structure AlexNet Clarifai Clarifai+Def layer Mean AP on val
236 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 236
237 Use the 1000 class Image classification score. ~1% map improvement. Context modeling 237
238 Context modeling Use the 1000-class Image classification score. ~1% map improvement. Volleyball: improve ap by 8.4% on val2. Volleyball Golf ball Bathing cap 238
239 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 239
240 Model averaging Not only change parameters Net structure: AlexNet(A), Clarifai (C), Deep-ID Net (D), DeepID Net2 (D2) Pretrain: Classification (C), Localization (L) Region rejection or not Loss of net, softmax (S), Hinge loss (H) Choose different sets of models for different object class Model Net structure A A C C D D D2 D D D Pretrain C C+L C C+L C+L C+L L L L L Reject region? Y N Y Y Y Y Y Y Y Y Loss of net S S S H H H H H H H Mean ap
241 RCNN Selective search AlexNet+ SVM Bounding box regression person horse Image DeepID-Net Proposed bounding boxes Detection results Refined bounding boxes Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 241
242 Component analysis Box Loc+ +Def +cont +bbox Model Detection Pipeline RCNN rejection Clarifai Det layer ext regr. avg. map on val map on test DeepID-Net Image Selective search Proposed bounding boxes person Box rejection Bounding box regression DeepID-Net Pretrain, defpooling layer, sub-box, Remaining hinge-loss bounding boxes person Model averaging person horse Context modelin g person horse horse horse 242
243 Summary Bounding rejection. Save feature extraction by about 10 times, slightly improve map (~1%). Pre-training with object-level annotation, more classes. 4.2% map Def-pooling layer. 2.5% map improvement Contextual modeling. 1% map improvement Model averaging. 2.3% map improvement. Different model designs and training schemes lead to high diversity 243
244 Reference R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR, Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arxiv preprint arxiv: (2013). W. Ouyang and X. Wang, "Joint Deep Learning for Pedestrian Detection," ICCV 2013 X. Zeng, W. Ouyang and X. Wang, "Multi-Stage Contextual Deep Learning for Pedestrian Detection," ICCV 2013 P. Luo, Y. Tian, X. Wang, and X. Tang, "Switchable Deep Network for Pedestrian Detection", CVPR 2014 W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship with a Deep Model in Pedestrian Detection," CVPR 2013 W. Ouyang, and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling," CVPR 2012 Y. Sun, X. Wang and X. Tang, "Deep Convolutional Network Cascade for Facial Point Detection," CVPR 2013 W. Ouyang, X. Chu, and X. Wang, "Multi-source Deep Learning for Human Pose Estimation", CVPR 2014 Y. Yang and D. Ramanan, Articulated Pose Estimation with Flexible Mixtures-of-Parts, CVPR
245 Reference X. Wang, M. Yang, S. Zhu, and Y. Lin, Regionlets for Generic Object Detection, ICCV S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun, Bottom-up Segmentation for Top-Down Detection, CVPR A. Krizhevsky, L. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Proc. NIPS, J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, Selective Search for Object Recognition, IJCV W. Ouyang, et al. DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection, arxiv: ,
246 Outline Introduction to deep learning Deep learning for object recognition Deep learning for object segmentation Deep learning for object detection Open questions and future works 246
247 Concerns on deep learning C1: Weak on theoretical support (convergence, bound, local minimum, why it works) It s true. That s why deep learning papers were not accepted by the computer vision/image processing community for a long time. Any theoretical studies in the future are important. 247
248 Most computer vision/multimedia papers Motivations New objective function New optimization algorithm Deep learning papers for computer vision/multimedia Motivations New network structure and new objective function Theoretical analysis Back propagation (standard) Experimental results Super experimental results That s probably one of the reasons that computer vision and image processing people think deep learning papers are lack of novelty and theoretical contribution 248
249 Concerns on deep learning C2: It is hard for computer vision/image processing people to have innovative contributions to deep learning. Our job becomes preparing the data + using deep learning as a black box. That s the end of our research life. That s not true. Computer vision and image processing researchers have developed many systems with deep architectures. But we just didn t know how to jointly learn all the components. Our research experience and insights can help to design new deep models and pretraining strategies. Many machine learning models and algorithms were motivated by computer vision and image processing applications. However, computer vision and multimedia did not have close interaction with neural networks in the past 15 years. We expect fast development of deep learning driven by applications. 249
250 Concerns on deep learning C3: Since the goal of neural networks is to solve the general learning problem, why do we need domain knowledge? The most successful deep model on image and video related applications is convolutional neural network, which has used domain knowledge (filtering, pooling) Domain knowledge is important especially when the training data is not large enough 250
251 Concerns on deep learning C4: Good results achieved by deep learning come from manually tuning network structures and learning rates, and trying different initializations That s not true. One round evaluation may take several weeks. There is no time to test all the settings. Designing and training deep models does require a lot of empirical experience and insights. There are also a lot of tricks and guidance provided by deep learning researchers. Most of them make sense intuitively but without strict proof. 251
252 Concerns on deep learning C5: Deep learning is more suitable for industry rather than research groups in universities Industry has big data and computation resources Research groups from universities can contribute on model design, training algorithms and new applications 252
253 Concerns on deep learning C6: Deep learning has different behaviors when the scale of training data is different Pre-training is useful when the training data small, but does not make big difference when the training data is large enough So far, the performance of deep learning keep increasing with the size of training data. We don t see its limit yet. Shall we spend more effort on data annotation or model design? 253
254 Future works Explore deep learning in new applications Worthy to try if the applications require features or learning, and have enough training data We once had many doubts on deep. (Does it work for vision? Does it work for segmentation? Does it work for low-level vision?) But deep learning has given us a lot of surprises. Applications will inspire many new deep models Incorporate domain knowledge into deep learning Integrate existing machine learning models with deep learning 254
255 Future works Deep learning to extract dynamic features for video analysis Deep models for structured data Theoretical studies on deep learning Quantitative analysis on how to design network structures and how to choose nonlinear operations of different layers in order to achieve feature invariance New optimization and training algorithms Parallel computing systems and algorithm to train very large and deep networks with larger training data 255
256 256
257 Thank you!
Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?
Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Erjin Zhou [email protected] Zhimin Cao [email protected] Qi Yin [email protected] Abstract Face recognition performance improves rapidly
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning
: Intro. to Computer Vision Deep learning University of Massachusetts, Amherst April 19/21, 2016 Instructor: Subhransu Maji Finals (everyone) Thursday, May 5, 1-3pm, Hasbrouck 113 Final exam Tuesday, May
Deep Learning Face Representation by Joint Identification-Verification
Deep Learning Face Representation by Joint Identification-Verification Yi Sun 1 Yuheng Chen 2 Xiaogang Wang 3,4 Xiaoou Tang 1,4 1 Department of Information Engineering, The Chinese University of Hong Kong
Image Classification for Dogs and Cats
Image Classification for Dogs and Cats Bang Liu, Yan Liu Department of Electrical and Computer Engineering {bang3,yan10}@ualberta.ca Kai Zhou Department of Computing Science [email protected] Abstract
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection
CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. [email protected] 2 3 Object detection
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016
Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016 Previously, end-to-end.. Dog Slide credit: Jose M 2 Previously, end-to-end.. Dog Learned Representation Slide credit: Jose
Convolutional Feature Maps
Convolutional Feature Maps Elements of efficient (and accurate) CNN-based object detection Kaiming He Microsoft Research Asia (MSRA) ICCV 2015 Tutorial on Tools for Efficient Object Detection Overview
Deformable Part Models with CNN Features
Deformable Part Models with CNN Features Pierre-André Savalle 1, Stavros Tsogkas 1,2, George Papandreou 3, Iasonas Kokkinos 1,2 1 Ecole Centrale Paris, 2 INRIA, 3 TTI-Chicago Abstract. In this work we
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Taking Inverse Graphics Seriously
CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously Geoffrey Hinton Department of Computer Science University of Toronto The representation used by the neural nets that work best
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected]
Steven C.H. Hoi School of Information Systems Singapore Management University Email: [email protected] Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles [email protected] and lunbo
Semantic Recognition: Object Detection and Scene Segmentation
Semantic Recognition: Object Detection and Scene Segmentation Xuming He [email protected] Computer Vision Research Group NICTA Robotic Vision Summer School 2015 Acknowledgement: Slides from Fei-Fei
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Bert Huang Department of Computer Science Virginia Tech
This paper was submitted as a final project report for CS6424/ECE6424 Probabilistic Graphical Models and Structured Prediction in the spring semester of 2016. The work presented here is done by students
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083.
Fast R-CNN Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:1504.08083. ECS 289G 001 Paper Presentation, Prof. Lee Result 1 67% Accuracy
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates 1, Honglak Lee 2, Andrew Y. Ng 1 1 Computer Science Department, Stanford University {acoates,ang}@cs.stanford.edu 2 Computer
Pedestrian Detection using R-CNN
Pedestrian Detection using R-CNN CS676A: Computer Vision Project Report Advisor: Prof. Vinay P. Namboodiri Deepak Kumar Mohit Singh Solanki (12228) (12419) Group-17 April 15, 2016 Abstract Pedestrian detection
Learning Deep Face Representation
Learning Deep Face Representation Haoqiang Fan Megvii Inc. [email protected] Zhimin Cao Megvii Inc. [email protected] Yuning Jiang Megvii Inc. [email protected] Qi Yin Megvii Inc. [email protected] Chinchilla Doudou
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition Marc Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, Yann LeCun Courant Institute of Mathematical Sciences,
Compacting ConvNets for end to end Learning
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
Learning and transferring mid-level image representions using convolutional neural networks
Willow project-team Learning and transferring mid-level image representions using convolutional neural networks Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic 1 Image classification (easy) Is there
Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang
Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Local features and matching. Image classification & object localization
Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15
Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15 GENIVI is a registered trademark of the GENIVI Alliance in the USA and other countries Copyright GENIVI Alliance
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning
Sense Making in an IOT World: Sensor Data Analysis with Deep Learning Natalia Vassilieva, PhD Senior Research Manager GTC 2016 Deep learning proof points as of today Vision Speech Text Other Search & information
AN IMPROVED DOUBLE CODING LOCAL BINARY PATTERN ALGORITHM FOR FACE RECOGNITION
AN IMPROVED DOUBLE CODING LOCAL BINARY PATTERN ALGORITHM FOR FACE RECOGNITION Saurabh Asija 1, Rakesh Singh 2 1 Research Scholar (Computer Engineering Department), Punjabi University, Patiala. 2 Asst.
arxiv:1312.6034v2 [cs.cv] 19 Apr 2014
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps arxiv:1312.6034v2 [cs.cv] 19 Apr 2014 Karen Simonyan Andrea Vedaldi Andrew Zisserman Visual Geometry Group,
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Object Recognition. Selim Aksoy. Bilkent University [email protected]
Image Classification and Object Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] Image classification Image (scene) classification is a fundamental
Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.
Visual search: what's next? Cees Snoek University of Amsterdam The Netherlands Euvision Technologies The Netherlands Problem statement US flag Tree Aircraft Humans Dog Smoking Building Basketball Table
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
CAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong Jan 26, 2016 Today Administrivia A bigger picture and some common questions Object detection proposals, by Samer
Blessing of Dimensionality: High-dimensional Feature and Its Efficient Compression for Face Verification
Blessing of Dimensionality: High-dimensional Feature and Its Efficient Compression for Face Verification Dong Chen Xudong Cao Fang Wen Jian Sun University of Science and Technology of China Microsoft Research
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks
Transfer Learning for Latin and Chinese Characters with Deep Neural Networks Dan C. Cireşan IDSIA USI-SUPSI Manno, Switzerland, 6928 Email: [email protected] Ueli Meier IDSIA USI-SUPSI Manno, Switzerland, 6928
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
DeepFace: Closing the Gap to Human-Level Performance in Face Verification Yaniv Taigman Ming Yang Marc Aurelio Ranzato Facebook AI Research Menlo Park, CA, USA {yaniv, mingyang, ranzato}@fb.com Lior Wolf
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
DeepFace: Closing the Gap to Human-Level Performance in Face Verification Yaniv Taigman Ming Yang Marc Aurelio Ranzato Facebook AI Group Menlo Park, CA, USA {yaniv, mingyang, ranzato}@fb.com Lior Wolf
The Visual Internet of Things System Based on Depth Camera
The Visual Internet of Things System Based on Depth Camera Xucong Zhang 1, Xiaoyun Wang and Yingmin Jia Abstract The Visual Internet of Things is an important part of information technology. It is proposed
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
3D Model based Object Class Detection in An Arbitrary View
3D Model based Object Class Detection in An Arbitrary View Pingkun Yan, Saad M. Khan, Mubarak Shah School of Electrical Engineering and Computer Science University of Central Florida http://www.eecs.ucf.edu/
Unsupervised Joint Alignment of Complex Images
Unsupervised Joint Alignment of Complex Images Gary B. Huang Vidit Jain University of Massachusetts Amherst Amherst, MA {gbhuang,vidit,elm}@cs.umass.edu Erik Learned-Miller Abstract Many recognition algorithms
Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011
Behavior Analysis in Crowded Environments XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011 Behavior Analysis in Sparse Scenes Zelnik-Manor & Irani CVPR
Simultaneous Gamma Correction and Registration in the Frequency Domain
Simultaneous Gamma Correction and Registration in the Frequency Domain Alexander Wong [email protected] William Bishop [email protected] Department of Electrical and Computer Engineering University
Deep learning applications and challenges in big data analytics
Najafabadi et al. Journal of Big Data (2015) 2:1 DOI 10.1186/s40537-014-0007-7 RESEARCH Open Access Deep learning applications and challenges in big data analytics Maryam M Najafabadi 1, Flavio Villanustre
Mean-Shift Tracking with Random Sampling
1 Mean-Shift Tracking with Random Sampling Alex Po Leung, Shaogang Gong Department of Computer Science Queen Mary, University of London, London, E1 4NS Abstract In this work, boosting the efficiency of
Learning Deep Architectures for AI. Contents
Foundations and Trends R in Machine Learning Vol. 2, No. 1 (2009) 1 127 c 2009 Y. Bengio DOI: 10.1561/2200000006 Learning Deep Architectures for AI By Yoshua Bengio Contents 1 Introduction 2 1.1 How do
Face Recognition in Low-resolution Images by Using Local Zernike Moments
Proceedings of the International Conference on Machine Vision and Machine Learning Prague, Czech Republic, August14-15, 014 Paper No. 15 Face Recognition in Low-resolution Images by Using Local Zernie
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University [email protected] Rob Fergus Department
Latest Advances in Deep Learning. Yao Chou
Latest Advances in Deep Learning Yao Chou Outline Introduction Images Classification Object Detection R-CNN Traditional Feature Descriptor Selective Search Implementation Latest Application Deep Learning
TRAFFIC sign recognition has direct real-world applications
Traffic Sign Recognition with Multi-Scale Convolutional Networks Pierre Sermanet and Yann LeCun Courant Institute of Mathematical Sciences, New York University {sermanet,yann}@cs.nyu.edu Abstract We apply
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
A Hybrid Neural Network-Latent Topic Model
Li Wan Leo Zhu Rob Fergus Dept. of Computer Science, Courant institute, New York University {wanli,zhu,fergus}@cs.nyu.edu Abstract This paper introduces a hybrid model that combines a neural network with
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Deep Learning for Big Data
Deep Learning for Big Data Yoshua Bengio Département d Informa0que et Recherche Opéra0onnelle, U. Montréal 30 May 2013, Journée de la recherche École Polytechnique, Montréal Big Data & Data Science Super-
3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels
3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels Luís A. Alexandre Department of Informatics and Instituto de Telecomunicações Univ. Beira Interior,
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Tutorial on Deep Learning and Applications
NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning Tutorial on Deep Learning and Applications Honglak Lee University of Michigan Co-organizers: Yoshua Bengio, Geoff Hinton, Yann LeCun,
Lecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
Università degli Studi di Bologna
Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of
Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability
Classification of Fingerprints Sarat C. Dass Department of Statistics & Probability Fingerprint Classification Fingerprint classification is a coarse level partitioning of a fingerprint database into smaller
Open Access A Facial Expression Recognition Algorithm Based on Local Binary Pattern and Empirical Mode Decomposition
Send Orders for Reprints to [email protected] The Open Electrical & Electronic Engineering Journal, 2014, 8, 599-604 599 Open Access A Facial Expression Recognition Algorithm Based on Local Binary
Transform-based Domain Adaptation for Big Data
Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
FACE RECOGNITION BASED ATTENDANCE MARKING SYSTEM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,
Scalable Object Detection by Filter Compression with Regularized Sparse Coding
Scalable Object Detection by Filter Compression with Regularized Sparse Coding Ting-Hsuan Chao, Yen-Liang Lin, Yin-Hsi Kuo, and Winston H Hsu National Taiwan University, Taipei, Taiwan Abstract For practical
Neural Networks and Support Vector Machines
INF5390 - Kunstig intelligens Neural Networks and Support Vector Machines Roar Fjellheim INF5390-13 Neural Networks and SVM 1 Outline Neural networks Perceptrons Neural networks Support vector machines
Probabilistic Latent Semantic Analysis (plsa)
Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg [email protected] www.multimedia-computing.{de,org} References
A Dynamic Convolutional Layer for Short Range Weather Prediction
A Dynamic Convolutional Layer for Short Range Weather Prediction Benjamin Klein, Lior Wolf and Yehuda Afek The Blavatnik School of Computer Science Tel Aviv University [email protected], [email protected],
What is the Right Illumination Normalization for Face Recognition?
What is the Right Illumination Normalization for Face Recognition? Aishat Mahmoud Dan-ali Department of Computer Science and Engineering The American University in Cairo AUC Avenue, P.O. Box 74, New Cairo
Determining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
International Journal of Advanced Information in Arts, Science & Management Vol.2, No.2, December 2014
Efficient Attendance Management System Using Face Detection and Recognition Arun.A.V, Bhatath.S, Chethan.N, Manmohan.C.M, Hamsaveni M Department of Computer Science and Engineering, Vidya Vardhaka College
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
The Applications of Deep Learning on Traffic Identification
The Applications of Deep Learning on Traffic Identification Zhanyi Wang [email protected] Abstract Generally speaking, most systems of network traffic identification are based on features. The features
Object Detection in Video using Faster R-CNN
Object Detection in Video using Faster R-CNN Prajit Ramachandran University of Illinois at Urbana-Champaign [email protected] Abstract Convolutional neural networks (CNN) currently dominate the computer
Simplified Machine Learning for CUDA. Umar Arshad @arshad_umar Arrayfire @arrayfire
Simplified Machine Learning for CUDA Umar Arshad @arshad_umar Arrayfire @arrayfire ArrayFire CUDA and OpenCL experts since 2007 Headquartered in Atlanta, GA In search for the best and the brightest Expert
arxiv:1506.03365v2 [cs.cv] 19 Jun 2015
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop Fisher Yu Yinda Zhang Shuran Song Ari Seff Jianxiong Xiao arxiv:1506.03365v2 [cs.cv] 19 Jun 2015 Princeton
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
Adaptive Face Recognition System from Myanmar NRC Card
Adaptive Face Recognition System from Myanmar NRC Card Ei Phyo Wai University of Computer Studies, Yangon, Myanmar Myint Myint Sein University of Computer Studies, Yangon, Myanmar ABSTRACT Biometrics is
A Comparison of Photometric Normalisation Algorithms for Face Verification
A Comparison of Photometric Normalisation Algorithms for Face Verification James Short, Josef Kittler and Kieron Messer Centre for Vision, Speech and Signal Processing University of Surrey Guildford, Surrey,
Handwritten Digit Recognition with a Back-Propagation Network
396 Le Cun, Boser, Denker, Henderson, Howard, Hubbard and Jackel Handwritten Digit Recognition with a Back-Propagation Network Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,
Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance
2012 IEEE International Conference on Multimedia and Expo Learning Detectors from Large Datasets for Object Retrieval in Video Surveillance Rogerio Feris, Sharath Pankanti IBM T. J. Watson Research Center
COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS
COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS B.K. Mohan and S. N. Ladha Centre for Studies in Resources Engineering IIT
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan
1 MulticoreWare Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan Focused on Heterogeneous Computing Multiple verticals spawned from core competency Machine Learning
