Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Previously, end-to-end.. Dog Slide credit: Jose M 2

Previously, end-to-end.. Dog Learned Representation Slide credit: Jose M 3

Previously, end-to-end.. Dog Learned Representation Part I: End-to-end learning (E2E) 4

Previously, end-to-end.. Learned Representation Task A (eg. image classification) Part I: End-to-end learning (E2E) 5

Previously,finetuning.. Part I: End-to-end learning (E2E) Learned Representation Domain A Transfer Part I: End-to-end learning (E2E) Part I: End-to-end learning (E2E) Fine-tuned Learned Representation Domain B Part I : End-to-End Fine-Tuning (FT) 6 slide credit: X. Giro

Previously,finetuning.. Fine-tuning a pre-trained network Slide credit: Victor Campos, Layer-wise CNN surgery for Visual Sentiment Prediction (ETSETB 2015) 7

Previously,finetuning.. Fine-tuning a pre-trained network Fine-tuning: High learning rate in new layer, and low learning rate in all other layers. Slide credit: Victor Campos, Layer-wise CNN surgery for Visual Sentiment Prediction (ETSETB 2015) 8

Previously, off-the-shelf features.. Learned Representation Task A (eg. image classification) Part I: End-to-end learning (E2E) Part II: Off-the-shelf features Task B (eg. image retrieval) 9 slide credit: X. Giro

Previously, off-the-shelf features.. Image classification: image as an input, label as output Orange 1 1 df d d d x y F spatial coded image representations (like spatial pyramids) orderless image representation (like BOW)

Two deep lectures in M5 Deep ConvNets for Recognition at... Global Scale (today s lecture) Local Scale (next lecture)

Image Classification Image classification: image as an input, label as output Orange How to process non-squared images? resize zero padding largest centred square

Local object recognition object localization (single object) object detection semantic segmentation

Classification+LOCALIZATION slide credit: Li, Karpathy, Johnson

Localization as regression slide credit: Li, Karpathy, Johnson

Localization as regression classification head slide credit: Li, Karpathy, Johnson regression head

Localization as regression Problem: multiple classes Classification head: C- class scores slide credit: Li, Karpathy, Johnson regression head: Cx4 - numbers

Localization as regression (example) Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015) Esteve Cervantes: Evaluating deep features for Fashion Recognition

Local object recognition object localization (single object) object detection any ideas? semantic segmentation

227 227 227 227 Sliding window 227 227 0.03 227 classification + regression 227 classification + regression 0.83 Compute a new regressed bounding box and classification score for all sliding window positions.

227 227 Sliding window 227 0.83 227 0.99 Repeat for different scales and combine all results (e.g. with non maxima suppression)

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 5 1 1 6 car/not car 2 6 10 conv 1 fc1 fc2 What are the spatial coordinates of conv1? 10 Part of the convolutional features are the same and do not need recomputation! 12x17

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 How many 10x10 windows are there in this 12x17 image? 12x17

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 5 12 5x5 8 The convolutions can be computed in a single pass. 12x17 17 13 conv 1

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 12 5 6x6x5 1x1x10 5x5 8 12x17 17 13 conv 1 fc2

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 12 5 3 10 5x5 8 12x17 17 13 conv 1 (5x5x3) 8 fc2=conv2 (6x6x5)

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 12 5 3 10 1x1x2 5x5 8 12x17 17 13 conv 1 (5x5x3) 8 fc2=conv2 (6x6x5) fc3

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network 10 conv1 filter (5x5) 10 12 10 5x5 5 1 1 6 car/not car 2 6 10 conv 1 fc1 fc2 We have the 8x3=24 classification scores sharing computation of the convolutional feaures. 5 10 3 2 3 5x5 8 12x17 17 13 conv 1 5 fillters of (5x5x3) 8 fc2=conv2 10 filters of (6x6x5) 8 fc3=conv3 2 filters of (1x1x10)

Sliding window (efficient computation) Networks can be written as fully convolutional networks to speed up computation at testing time. Example of bear and fish detection on multiple scales. Semanet et al, Integrated Recognition, Localization and Detection using Convolutional Networks ICLR 2014

object proposals object proposal methods compute boxes which potentially contain an object. Features for each box are extracted and a classifier is applied. typically thousands of boxes (but much less than sliding window) Many different approaches: selective search, edge boxes, GOP, etc. selective search K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.

object proposals (RCNN) bounding box regression car: yes person : no 1. compute object proposals (~2k) 2. warp dilated bounding box 3. compute CNN features 4. classify regions Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (RCNN) Alex Net Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (RCNN) remove last layer and finetune for 20 PASCAL classes Alex Net Use fc7 4096-d vector as the description of the bounding box. Train a SVM on this representation for classification Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (RCNN) slide credit: Girshick

object proposals (RCNN)

object proposals (RCNN) slide credit: Li, Karpathy, Johnson

object proposals (RCNN) drawbacks: not end-to-end warping of boxes lots of double computation (overlap of bounding boxes) improved bounding box car: yes person : no 1. compute object proposals (~2k) 2. warp dilated bounding box 3. compute CNN features 4. classify regions Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (Fast R-CNN)

shared computation (conv1-conv5) object proposals (Fast R-CNN) conv 5 compute ones the convolutional features per image. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

shared computation object proposals (Fast R-CNN) conv 5 compute ones the convolutional features extract features from conv5 for all bb s This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

shared computation object proposals (Fast R-CNN) for all bounding boxes: Region of Interest pooling (ROI pooling) pool the features in a spatial grid.

shared computation object proposals (Fast R-CNN) classification: log loss ROI pooling: FCs regression: smooth L1 loss pool the features in a spatial grid end-to-end training

object proposals (Fast R-CNN) multi-task improves also classification performance. end-to-end improves results Fast R-CNN R-CNN Train time 9.5 84 -speedup 8.8x - Test time/image 0.32s 47s Test speedup 146x - map 66.9% 66.0% Test time does not include object proposal computation (which is now the bottleneck)

shared computation object proposals (Faster R-CNN) FCs Region Proposal Network (RPN) ROI pooling: conv5 compute the object proposals directly in the network.

object proposals (Faster R-CNN) Slide a window over the feature map. Add a network which classifies and regresses the bounding boxes. The classification score provides the confidence of the presence of object. slide credit: Kaming He

object proposals (Faster R-CNN) Slide a window over the feature map. Add a network which classifies and regresses the bounding boxes. The classification score provides the confidence of the presence of object. Use N anchors for proposals of varying aspect ratios. slide credit: Kaming He

object proposals (Faster R-CNN) Model Time Edge boxes + R-CNN 0.25 sec + 1000*ConvTime + 1000*FcTime Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime + 1000*FcTime faster R-CNN 1*ConvTime + 1000*FcTime Computation for 1000 boxes. slide credit: Kaming He

object proposals (Faster R-CNN) slide credit: Li, Karpathy, johnson

object localization Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with residual networks and Faster RCNN. 2015 challenge

object localization Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with residual networks and Faster RCNN 2015 challenge

summary object detection object localization: when there is one or a known number of objects/classes you can do object localization by adding a regression head to your network. Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network. Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider: adding a regression head to improve bounding box estimation. share computation of the convolutional features (SPP) end-to-end training of network (fast RCNN) include Region Proposal Network for fast object proposals within the network (faster RCNN). slide credit: Li, Karpathy, johnson

Local object recognition object localization (single object) object detection semantic segmentation

semantic segmentation semantic segmentation: assign a class to all pixels instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)

semantic segmentation ConvNet predict center pixel Write network as fully convolutional network and apply to image Because of the convolutions the resolution is smaller and upsampling is required

semantic segmentation pixelwise loss Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1] Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1]

input input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1] Convolution (3x3) padding [1 1 1 1] stride [2 2]

input semantic segmentation deconvolution (3x3) padding [1 1 1 1] stride [2 2]

input semantic segmentation deconvolution (3x3) padding [1 1 1 1] stride [2 2] deconvolutions are also called fractionally strided convolutions, convolution transpose.

semantic segmentation Noh et al. ICCV 2015

semantic segmentation combine where (local, shallow) with what (global, deep) Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

semantic segmentation skip layers interp + sum interp + sum dense output Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

semantic segmentation input image stride 32 stride 16 stride 8 ground truth no skips 1 skip 2 skips Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

semantic segmentation Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

semantic segmentation Surface normals results Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

instance segmentation Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

instance segmentation results ground-truth Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

Generative Adversarial Networks noise Fractionally strided convolutions (deconvolutions) can be used to generate images. Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z) D max log D x log 1 D D G z

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x I can then optimize my generative network to fool the discriminative network. D min G maxlog D x log 1 D D G z

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x D You can re-optimize the Discriminate network D, etc... min G maxlog D x log 1 D D G z

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x You can re-optimize the Discriminate network D, etc...until D gives in... D min G maxlog D x log 1 D D G z Goodman et al. Generative Adversarial Nets NIPS 2014

Generative Adversarial Networks Examples of generated bedrooms. Unsupervised Representation Radford et al. Learning with Deep Convolutional Generative Adversarial Nteworks ICLR 2016

Generative Adversarial Networks Interpolation between points in z. Unsupervised Representation Radford et al. Learning with Deep Convolutional Generative Adversarial Nteworks ICLR 2016

summary semantic segmentation Fully convolutional networks can be applied for efficient classification of all pixels. To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers). upsampling can be done by de-convolution and de-pooling operations. Instance segmentation can be performed by combining object detection and semantic segmentation pipelines. slide credit: Li, Karpathy, johnson