Learning and transferring mid-level image representions using convolutional neural networks

Similar documents

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Deformable Part Models with CNN Features

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Convolutional Feature Maps

Image Classification for Dogs and Cats

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Semantic Image Segmentation and Web-Supervised Visual Learning

Return of the Devil in the Details: Delving Deep into Convolutional Nets

arxiv: v6 [cs.cv] 10 Apr 2015

Image and Video Understanding

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

CAP 6412 Advanced Computer Vision

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Local features and matching. Image classification & object localization

Do Convnets Learn Correspondence?

Pedestrian Detection using R-CNN

Studying Relationships Between Human Gaze, Description, and Computer Vision

Introduction to Machine Learning CMU-10701

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

Steven C.H. Hoi School of Information Systems Singapore Management University

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Semantic Recognition: Object Detection and Scene Segmentation

Multi-Column Deep Neural Network for Traffic Sign Classification

Pedestrian Detection with RCNN

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Bert Huang Department of Computer Science Virginia Tech

Compacting ConvNets for end to end Learning

Simultaneous Deep Transfer Across Domains and Tasks

Applications of Deep Learning to the GEOINT mission. June 2015

Segmentation as Selective Search for Object Recognition

arxiv: v2 [cs.cv] 19 Jun 2015

Latest Advances in Deep Learning. Yao Chou

Fast R-CNN Object detection with Caffe

Learning to Process Natural Language in Big Data Environment

arxiv: v2 [cs.cv] 19 Apr 2014

Conditional Random Fields as Recurrent Neural Networks

The Delicate Art of Flower Classification

Image denoising: Can plain Neural Networks compete with BM3D?

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

3D Model based Object Class Detection in An Arbitrary View

arxiv: v1 [cs.cv] 6 Feb 2015

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Blocks that Shout: Distinctive Parts for Scene Classification

Real-Time Grasp Detection Using Convolutional Neural Networks

Object Categorization using Co-Occurrence, Location and Appearance

InstaNet: Object Classification Applied to Instagram Image Streams

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

ImageNet Classification with Deep Convolutional Neural Networks

SSD: Single Shot MultiBox Detector

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

How To Use A Near Neighbor To A Detector

arxiv: v1 [cs.cv] 29 Apr 2016

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Object Recognition. Selim Aksoy. Bilkent University

Taking Inverse Graphics Seriously

Transform-based Domain Adaptation for Big Data

Understanding Deep Image Representations by Inverting Them

Supporting Online Material for

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

Privacy-CNH: A Framework to Detect Photo Privacy with Convolutional Neural Network Using Hierarchical Features

PANDA: Pose Aligned Networks for Deep Attribute Modeling

RECOGNIZING objects and localizing them in images is

Two-Stream Convolutional Networks for Action Recognition in Videos

A Dynamic Convolutional Layer for Short Range Weather Prediction

Learning Mid-Level Features For Recognition

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

arxiv: v1 [cs.cv] 18 May 2015

Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object Detection in Video using Faster R-CNN

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Multi-view Face Detection Using Deep Convolutional Neural Networks

Task-driven Progressive Part Localization for Fine-grained Recognition

Character Image Patterns as Big Data

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Introduction to Machine Learning Using Python. Vikram Kamath

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

arxiv:submit/ [cs.cv] 13 Apr 2016

Deep Residual Networks

Handwritten Digit Recognition with a Back-Propagation Network

Simplified Machine Learning for CUDA. Umar

arxiv: v2 [cs.cv] 27 Sep 2015

Multi-fold MIL Training for Weakly Supervised Object Localization

III. SEGMENTATION. A. Origin Segmentation

A Hybrid Neural Network-Latent Topic Model

Making Sense of the Mayhem: Machine Learning and March Madness

Flexible, High Performance Convolutional Neural Networks for Image Classification

Two big challenges in machine learning LÉON BOTTOU

VEHICLE LOCALISATION AND CLASSIFICATION IN URBAN CCTV STREAMS

Object Detection from Video Tubelets with Convolutional Neural Networks

THE HUMAN BRAIN. observations and foundations

Discovering objects and their location in images

Improving Spatial Support for Objects via Multiple Segmentations

Transcription:

Willow project-team Learning and transferring mid-level image representions using convolutional neural networks Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic 1

Image classification (easy) Is there a car? Source : Pascal VOC dataset 2

Image classification (harder) Is there a boat? Source : Pascal VOC dataset 3

Image classification (harder) Is there a boat? Source : Pascal VOC dataset 4

Image classification (v.hard) Is there a person? Source : Pascal VOC dataset 5

Image classification (v.hard) Source : Pascal VOC dataset 6

Pascal VOC vs. ImageNet classification Pascal VOC : complex scenes 20 object classes 10k images ImageNet : object-centric 1000 object classes 1.2M images 7

Image classification Traditional methods: HOG, SIFT, FV, SVMs, DPM, k-means, GMM... [Csurka et al.'04], [Lowe'04], [Sivic & Zisserman'03], [Perronin et al.'10], [Lazebnik et al.'06], [Zhang et al. 07], [Boureau et al.'10], [Singh et al.'12], [Juneja et al.'13], [Chatfield et al. 11], [van Gemert et al. 08], [Wang et al. 10], [Zhou et al. 10], [Dong et al. 13], [Feifei et al. 05], [Shotton et al. 05], [Moosmann et al. 05], [Grauman & Darrell 05] [Harzallah et al. 09], [...] Convolutional neural networks ImageNet challenge [Krizhevsky et al. 2012] 8

Brief history of CNNs Rosenblatt, 1957 : The perceptron : a perceiving and recognizing automaton. Hubel & Wiesel 1959 : Receptive fields of single neurons in the cat s striate cortex Fukushima 1980 : Neocognition Rumelhart et al. 1986 : Learning representations by back-propagating errors LeCun et al. 1989 : Backpropagation applied to handwritten zip code recognition. LeCun et al. 1998 : Efficient Backprop LeCun et al. 1998 : Gradient-based learning applied to document recognition Hinton & Salakhutdinov, 2006 : Reducing the Dimensionality of Data with Neural Networks Krizhevsky et al. 2012 : ImageNet classification with deep convolutional neural networks. Zeiler & Fergus, 2013 : Visualizing and understanding neural networks Sermanet et al. 2013 : Overfeat, Donahue et al. 2013 : Decaf Girshick et al. 2014 : Rich feature hierarchies for accurate object detection and semantic segmentation Razavian et al. 2014 : CNN features off-the-shelf, an astounding baseline for recognition Chatfield et al. 2014 : Return of the devil in the details 9

Neural Networks layers X0 X1 X2 Cost Input w1 w2 weights (parameters) Differentiable operations : weights trained by gradient descent. 10

8-layer NN [Krizhevsky et al.] 60 million parameters : - ImageNet (1.2M images) : OK - Pascal VOC (10k images) :? 11

Pascal VOC : different task Car examples from Pascal VOC Typical car examples from ImageNet 12

Pascal VOC : different task Car examples from Pascal VOC Typical car examples from ImageNet 13

Solution : multi-scale patch tiling Goal : obtain a dataset that looks like ImageNet. Small-scale tiling Large-scale tiling Typical Pascal VOC car example...... in disguise Typical car examples from ImageNet 14

Solution : multi-scale patch tiling Around 500 tiles per image. Multiple scales and positions. Label depending on overlap. background car car 15

First attempt Train CNN on Pascal VOC patches : Result : 70.9% map. We observe overfitting. State of the art : 82.2% map (NUS-PSL). How to benefit from the power of neural networks? We propose transfer learning. 16

Transfer learning ImageNet Source task Source task labels African elephant Layers L1-L7 L8 Wall clock Green snake ImageNet network Yorkshire terrier

Transfer learning ImageNet Source task Source task labels African elephant Layers L1-L7 L8 Wall clock Green snake Yorkshire terrier Pascal VOC Chair Background Layers L1-L7 La Lb Person TV/monitor Sliding patches Target task Target task labels 18

Transfer learning ImageNet Source task Source task labels African elephant Layers L1-L7 L8 Wall clock Green snake Yorkshire terrier Pascal VOC Chair Background Layers L1-L7 La Lb Person TV/monitor Sliding patches Target task Target task labels 19

Transfer learning ImageNet Source task Source task labels African elephant Layers L1-L7 L8 Wall clock Green snake Pascal VOC Transfer parameters Yorkshire terrier Chair Background Layers L1-L7 La Lb Person TV/monitor Sliding patches Target task Target task labels 20

Second attempt (with pre-training) After pre-training on the ILSVRC-2012 dataset, we obtain 78.7% mean AP (no pre-train : 70.9%). Significantly better but can we improve more? +18 % +14 % Observe large boosts for dog and bird classes. Well-represented groups in ILSVRC-2012. 21

Pre-training data Inspect 22k classes of the ImageNet tree: «furniture» subtree contains chairs, dining tables, sofas «hoofed mammal» subtree contains sheep, horses, cows... Add 512 classes to the pre-training, Result improves from 78.8% to 82.8% map. All scores increase, targeted classes improve more. 22

Computing scores at test time We extract 500 multi-scale patches. Image score = sum of all patch scores. Pixel score = sum of overlapping patches scores (heat maps) CNN person classifier 23

Qualitative results Chair Dining table Potted plant Sofa 24 Person TV monitor Source : Pascal VOC 12 test set 24

Qualitative results Chair Dining table Potted plant Sofa 24 Person TV monitor Source : Pascal VOC 12 test set 25

Qualitative results Chair Dining table Potted plant Sofa 24 Person TV monitor Source : Pascal VOC 12 test set 26

Qualitative results Chair Dining table Potted plant Sofa 24 Person TV monitor Source : Pascal VOC 12 test set 27

Visualizations (aeroplane) First false positive Source : Pascal VOC 12 test set 28

Visualizations (bicycle) First false positive Source : Pascal VOC 12 test set 29

Visualizations (bicycle) First false positive Source : Pascal VOC 12 test set 30

Visualizations (sheep) First false positive Source : Pascal VOC 12 test set 31

Visualizations (sheep) First false positive Source : Pascal VOC 12 test set 32

Quantitative results Pascal VOC 12 object classification : State of the art : 33

Quantitative results Pascal VOC 12 object classification : State of the art : No pre-training baseline : 34

Quantitative results Pascal VOC 12 object classification : State of the art : No pre-training baseline : 1000 ILSVRC classes : 35

Quantitative results Pascal VOC 12 object classification : State of the art : No pre-training baseline : 1000 ILSVRC classes : 1512 classes (our best) : 36

Quantitative results Pascal VOC 12 object classification : State of the art : No pre-training baseline : 1000 ILSVRC classes : Random 1000 classes : 1512 classes (our best) : 37

Different task : action classification (still images) playing instrument playing instrument jumping running 0 Source : Pascal VOC 12 Action classification test set State-of-the-art 70.2% map result 38

Different task : action classification (still images) playing instrument playing instrument jumping running 0 Source : Pascal VOC 12 Action classification test set State-of-the-art 70.2% map result 39

Qualitative results (reading) 40

Qualitative results (playing instrument) 41

Qualitative results (phoning) 42

Take-home messages Transfer learning with CNNs avoids overfitting See also : [Girshick et al. 14], [Sermanet et al. 13 ], [Donahue et al. 13], [Zeiler & Fergus 13], [Razavian et al. 14], [Chatfield et al. 14] We study the effect of pre-training data : More pre-training data => better Related pre-training data => even better Transfer to action classification. http://www.di.ens.fr/willow/research/cnn/ Implementation (Torch7 modules) available soon Includes efficient and flexible GPU training code 43

This work training bounding boxes «dog» heatmap Bounding box annotation is expensive. Can we avoid it? YES WE CAN! 44

http://www.di.ens.fr/willow/research/weakcnn/ 45 Follow-up work image-level labels only «dog» heatmap Weakly supervised, no bounding boxes required 82.8 => 86.3% mean AP on VOC classification Appearing on Arxiv soon (check our webpage)

Willow project-team Weakly supervised object recognition with convolutional neural networks Maxime Oquab, Léon Bottou, Ivan Laptev, Josef Sivic (All following slides stolen from Josef Sivic) 146

Are bounding boxes needed for training CNNs? Image-level labels: Bicycle, Person [Oquab, Bottou, Laptev, Sivic, In submission, 2014] 47

Motivation: labeling bounding boxes is tedious 48

Motivation: image-level labels are plentiful Beautiful red leaves in a back street of Freiburg [Kuznetsova et al., ACL 2013] http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1k.html 49

Let the algorithm localize the object in the image typical Example training images with bounding boxes cluttered cropped CNN score maps training images The locations of objects learnt by the CNN gure 1: Top row: example images for the motorbike class and corresponding ground tr m Pascal NB: Related VOC12 to multiple training instance set. learning, Bottome.g. row: [Viola corresponding et al. 05] and weakly per-pixel supervised score maps prod object localization, e.g. [Pandy and Lazebnik 11], [Prest et al. 12], pervised motorbike classifier that has been trained from image class labels only. [Oquab, Bottou, Laptev, Sivic, In submission, 2014] this paper, we focus on algorithms that recover classes and locations of object 50

Approach: search over object s location Per-image score C1'C2'C3'C4'C5$ FC6$ FC7$ 9216' dim$ vector$ 4096'$ dim$ vector$ 4096' dim$ vector$ FCa$ FCb$ Max-pool over image Max motorbike person diningtable pottedplant chair car bus train 1. Efficient window sliding to find object location hypothesis 2. Image-level aggregation (max-pool) 3. Multi-label loss function (allow multiple objects in image) See also [Sermanet et al. 14] and [Chaftield et al. 14] 51

Approach: search over object s location Note : All FC-layers are now large convolutions C1'C2'C3'C4'C5$ FC6$ FC7$ 9216' dim$ vector$ 4096'$ dim$ vector$ 4096' dim$ vector$ FCa$ FCb$ Max-pool over image Max Per-image score motorbike person diningtable pottedplant chair car bus train 1. Efficient window sliding to find object location hypothesis 2. Image-level aggregation (max-pool) 3. Multi-label loss function (allow multiple objects in image) See also [Sermanet et al. 14] and [Chaftield et al. 14] 52

Approach: search over object s location Per-image score C1'C2'C3'C4'C5$ FC6$ FC7$ 9216' dim$ vector$ 4096'$ dim$ vector$ 4096' dim$ vector$ FCa$ FCb$ Max-pool over image Max motorbike person diningtable pottedplant chair car bus train 1. Efficient window sliding to find object location hypothesis 2. Image-level aggregation (max-pool) 3. Multi-label loss function (allow multiple objects in image) See also [Sermanet et al. 14] and [Chaftield et al. 14] 53

Max-pooling search Search for objects using max-pooling «Found something there!» Receptive field of the maximumscoring neuron aeroplane map car map max-pool max-pool «Keep Correct up label: the good increase work score!» (increase for this score) class Incorrect «Wrong!» label: (decrease decrease score) score for this class mardi 10 juin 14 54

Search for objects using max-pooling a What is the effect of errors? 55

Multi-scale training and testing 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 Rescale [ 0.7 1.4 ] Rescale Convolutional adaptation layers. Figure 3: Weakly supervised training Figure 4: Multiscale object recognition chair diningtable person pottedplant person car bus train chair diningtable sofa pottedplant person car bus train The network architecture of [26] assumes a fixed-size image 56

Evolution of maps during training 57

Results map plane bike bird boat btl bus car cat chair cow A. ZEILER AND FERGUS [40] 79.0 96.0 77.1 88.4 85.5 55.8 85.8 78.6 91.2 65.0 74.4 B. OQUAB ET AL. [26] 82.8 94.6 82.9 88.2 84.1 60.3 89.0 84.4 90.7 72.1 86.8 C. CHATFIELD ET AL. [4] 83.2 96.8 82.5 91.5 88.1 62.1 88.3 81.9 94.8 70.3 80.2 D. FULL IMAGES (OUR) 78.7 95.3 77.4 85.6 83.1 49.9 86.7 77.7 87.2 67.1 79.4 E. STRONG+WEAK (OUR) 86.0 96.5 88.3 91.9 87.7 64.0 90.3 86.8 93.7 74.0 89.8 F. WEAK SUPERVISION (OUR) 86.3 96.7 88.8 92.0 87.4 64.7 91.1 87.4 94.4 74.9 89.2 table dog horse moto pers plant sheep sofa train tv 67.7 87.8 86.0 85.1 90.9 52.2 83.6 61.1 91.8 76.1 69.0 92.1 93.4 88.6 96.1 64.3 86.6 62.3 91.1 79.8 76.2 92.9 90.3 89.3 95.2 57.4 83.6 66.4 93.5 81.9 73.5 85.3 90.3 85.6 92.7 47.8 81.5 63.4 91.4 74.1 76.3 93.4 94.9 91.2 97.3 66.0 90.9 69.9 93.9 83.2 76.3 93.7 95.2 91.1 97.6 66.2 91.2 70.0 94.5 83.7 Table 1: Per-class results for object classification on the VOC2012 test set (average precision %). Best results are shown in bold. Our weakly supervised setup outperforms the state-of-the-art on all but three object classes. - Localizing objects by sliding helps - Full supervision does not improve over weak supervision Benefits of object localization during training. First we compare the proposed weakly supervised - New method state-of-the-art (F. WEAK SUPERVISION on Pascal in Table 1) VOC with training 2012 from object full images classification (D. FULL IM- AGES), where no search for object location during training/testing is performed and images are presented to the network at a single scale. Otherwise the network architectures are identical. The results clearly demonstrate the benefits of sliding window multi-scale training attempting to localize the objects in the training data. Note that this type of full-image training is similar to the set-up 58 mardi 5 used août 14in Zeiler and Fergus [40] (A.), Chatfield et al. [4] (C.) or Razavian et al. [31] (not shown in the

78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 Object localization examples in testing data (a) Representative true positives (b) Top ranking false positives aeroplane aeroplane aeroplane bicycle bicycle bicycle boat boat boat bird bird bird bottle bottle bottle bus bus bus Figure 5: Output probability maps on representative images of several categories from the Pascal VOC 2012 59

Are bounding boxes harmful? Motivation (CVPR 14) Output of the fully supervised CVPR 14 network: Why Why a higher a higher score score on on the the dog s dog s head head?? Is the network trying to say something? Responses are inconsistent with the annotations. Responses are inconsistent with the annotations. Maybe we are doing it wrong. Maybe we are doing it wrong. mardi 10 juin 14 60

Are bounding boxes harmful? Bounding boxes are NOT alignment. Should be treated as guidance not supervision (at least for object classification) typical cluttered cropped CNN score maps training images 61 gure 1: Top row: example images for the motorbike class and corresponding ground