Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Transcription

1 Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

2 Previously, end-to-end.. Dog Slide credit: Jose M 2

3 Previously, end-to-end.. Dog Learned Representation Slide credit: Jose M 3

4 Previously, end-to-end.. Dog Learned Representation Part I: End-to-end learning (E2E) 4

5 Previously, end-to-end.. Learned Representation Task A (eg. image classification) Part I: End-to-end learning (E2E) 5

6 Previously,finetuning.. Part I: End-to-end learning (E2E) Learned Representation Domain A Transfer Part I: End-to-end learning (E2E) Part I: End-to-end learning (E2E) Fine-tuned Learned Representation Domain B Part I : End-to-End Fine-Tuning (FT) 6 slide credit: X. Giro

7 Previously,finetuning.. Fine-tuning a pre-trained network Slide credit: Victor Campos, Layer-wise CNN surgery for Visual Sentiment Prediction (ETSETB 2015) 7

8 Previously,finetuning.. Fine-tuning a pre-trained network Fine-tuning: High learning rate in new layer, and low learning rate in all other layers. Slide credit: Victor Campos, Layer-wise CNN surgery for Visual Sentiment Prediction (ETSETB 2015) 8

9 Previously, off-the-shelf features.. Learned Representation Task A (eg. image classification) Part I: End-to-end learning (E2E) Part II: Off-the-shelf features Task B (eg. image retrieval) 9 slide credit: X. Giro

10 Previously, off-the-shelf features.. Image classification: image as an input, label as output Orange 1 1 df d d d x y F spatial coded image representations (like spatial pyramids) orderless image representation (like BOW)

11 Two deep lectures in M5 Deep ConvNets for Recognition at... Global Scale (today s lecture) Local Scale (next lecture)

12 Image Classification Image classification: image as an input, label as output Orange How to process non-squared images? resize zero padding largest centred square

13 Local object recognition object localization (single object) object detection semantic segmentation

14 Classification+LOCALIZATION slide credit: Li, Karpathy, Johnson

15 Localization as regression slide credit: Li, Karpathy, Johnson

17 Localization as regression classification head slide credit: Li, Karpathy, Johnson regression head

18 Localization as regression classification head slide credit: Li, Karpathy, Johnson regression head

20 Localization as regression Problem: multiple classes Classification head: C- class scores slide credit: Li, Karpathy, Johnson regression head: Cx4 - numbers

22 Localization as regression (example) Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015) Esteve Cervantes: Evaluating deep features for Fashion Recognition

23 Local object recognition object localization (single object) object detection any ideas? semantic segmentation

24 Sliding window classification + regression 227 classification + regression 0.83 Compute a new regressed bounding box and classification score for all sliding window positions.

25 Sliding window Repeat for different scales and combine all results (e.g. with non maxima suppression)

26 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) x car/not car conv 1 fc1 fc2 What are the spatial coordinates of conv1? 10 Part of the convolutional features are the same and do not need recomputation! 12x17

27 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) x conv 1 fc1 fc2 car/not car 10 How many 10x10 windows are there in this 12x17 image? 12x17

28 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) x conv 1 fc1 fc2 car/not car x5 8 The convolutions can be computed in a single pass. 12x conv 1

29 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) x conv 1 fc1 fc2 car/not car x6x5 1x1x10 5x5 8 12x conv 1 fc2

30 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) x conv 1 fc1 fc2 car/not car x5 8 12x conv 1 (5x5x3) 8 fc2=conv2 (6x6x5)

31 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) x conv 1 fc1 fc2 car/not car x1x2 5x5 8 12x conv 1 (5x5x3) 8 fc2=conv2 (6x6x5) fc3

32 10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network 10 conv1 filter (5x5) x car/not car conv 1 fc1 fc2 We have the 8x3=24 classification scores sharing computation of the convolutional feaures x5 8 12x conv 1 5 fillters of (5x5x3) 8 fc2=conv2 10 filters of (6x6x5) 8 fc3=conv3 2 filters of (1x1x10)

33 Sliding window (efficient computation) Networks can be written as fully convolutional networks to speed up computation at testing time. Example of bear and fish detection on multiple scales. Semanet et al, Integrated Recognition, Localization and Detection using Convolutional Networks ICLR 2014

34 object proposals object proposal methods compute boxes which potentially contain an object. Features for each box are extracted and a classifier is applied. typically thousands of boxes (but much less than sliding window) Many different approaches: selective search, edge boxes, GOP, etc. selective search K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.

35 object proposals (RCNN) bounding box regression car: yes person : no 1. compute object proposals (~2k) 2. warp dilated bounding box 3. compute CNN features 4. classify regions Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

36 object proposals (RCNN) Alex Net Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

37 object proposals (RCNN) remove last layer and finetune for 20 PASCAL classes Alex Net Use fc d vector as the description of the bounding box. Train a SVM on this representation for classification Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

38 object proposals (RCNN) slide credit: Girshick

39 object proposals (RCNN)

40 object proposals (RCNN) slide credit: Li, Karpathy, Johnson

41 object proposals (RCNN) drawbacks: not end-to-end warping of boxes lots of double computation (overlap of bounding boxes) improved bounding box car: yes person : no 1. compute object proposals (~2k) 2. warp dilated bounding box 3. compute CNN features 4. classify regions Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

42 object proposals (Fast R-CNN)

43 shared computation (conv1-conv5) object proposals (Fast R-CNN) conv 5 compute ones the convolutional features per image. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

44 shared computation object proposals (Fast R-CNN) conv 5 compute ones the convolutional features extract features from conv5 for all bb s This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

45 shared computation object proposals (Fast R-CNN) for all bounding boxes: Region of Interest pooling (ROI pooling) pool the features in a spatial grid.

46 shared computation object proposals (Fast R-CNN) classification: log loss ROI pooling: FCs regression: smooth L1 loss pool the features in a spatial grid end-to-end training

47 object proposals (Fast R-CNN) multi-task improves also classification performance. end-to-end improves results Fast R-CNN R-CNN Train time speedup 8.8x - Test time/image 0.32s 47s Test speedup 146x - map 66.9% 66.0% Test time does not include object proposal computation (which is now the bottleneck)

48 shared computation object proposals (Faster R-CNN) FCs Region Proposal Network (RPN) ROI pooling: conv5 compute the object proposals directly in the network.

49 object proposals (Faster R-CNN) Slide a window over the feature map. Add a network which classifies and regresses the bounding boxes. The classification score provides the confidence of the presence of object. slide credit: Kaming He

50 object proposals (Faster R-CNN) Slide a window over the feature map. Add a network which classifies and regresses the bounding boxes. The classification score provides the confidence of the presence of object. Use N anchors for proposals of varying aspect ratios. slide credit: Kaming He

51 object proposals (Faster R-CNN) Model Time Edge boxes + R-CNN 0.25 sec *ConvTime *FcTime Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime *FcTime faster R-CNN 1*ConvTime *FcTime Computation for 1000 boxes. slide credit: Kaming He

52 object proposals (Faster R-CNN) slide credit: Li, Karpathy, johnson

53 object proposals (Faster R-CNN) slide credit: Li, Karpathy, johnson

54 object localization Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with residual networks and Faster RCNN challenge

55 object localization Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with residual networks and Faster RCNN 2015 challenge

56 summary object detection object localization: when there is one or a known number of objects/classes you can do object localization by adding a regression head to your network. Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network. Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider: adding a regression head to improve bounding box estimation. share computation of the convolutional features (SPP) end-to-end training of network (fast RCNN) include Region Proposal Network for fast object proposals within the network (faster RCNN). slide credit: Li, Karpathy, johnson

57 Local object recognition object localization (single object) object detection semantic segmentation

58 semantic segmentation semantic segmentation: assign a class to all pixels instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)

59 semantic segmentation ConvNet predict center pixel Write network as fully convolutional network and apply to image Because of the convolutions the resolution is smaller and upsampling is required

60 semantic segmentation pixelwise loss Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

61 input semantic segmentation Convolution (3x3) padding [ ] stride [1 1] Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

62 input semantic segmentation Convolution (3x3) padding [ ] stride [1 1]

63 input input semantic segmentation Convolution (3x3) padding [ ] stride [1 1] Convolution (3x3) padding [ ] stride [2 2]

64 input input semantic segmentation Convolution (3x3) padding [ ] stride [1 1] Convolution (3x3) padding [ ] stride [2 2]

65 input semantic segmentation deconvolution (3x3) padding [ ] stride [2 2]

66 input semantic segmentation deconvolution (3x3) padding [ ] stride [2 2] deconvolutions are also called fractionally strided convolutions, convolution transpose.

67 semantic segmentation Noh et al. ICCV 2015

68 semantic segmentation Noh et al. ICCV 2015

69 semantic segmentation combine where (local, shallow) with what (global, deep) Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

70 semantic segmentation skip layers interp + sum interp + sum dense output Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

71 semantic segmentation input image stride 32 stride 16 stride 8 ground truth no skips 1 skip 2 skips Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

72 semantic segmentation Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

73 semantic segmentation Surface normals results Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

74 instance segmentation Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

77 instance segmentation results ground-truth Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

78 Generative Adversarial Networks noise Fractionally strided convolutions (deconvolutions) can be used to generate images. Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

79 Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z) D max log D x log 1 D D G z

80 Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x I can then optimize my generative network to fool the discriminative network. D min G maxlog D x log 1 D D G z

81 Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x D You can re-optimize the Discriminate network D, etc... min G maxlog D x log 1 D D G z

82 Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x You can re-optimize the Discriminate network D, etc...until D gives in... D min G maxlog D x log 1 D D G z Goodman et al. Generative Adversarial Nets NIPS 2014

83 Generative Adversarial Networks Examples of generated bedrooms. Unsupervised Representation Radford et al. Learning with Deep Convolutional Generative Adversarial Nteworks ICLR 2016

84 Generative Adversarial Networks Interpolation between points in z. Unsupervised Representation Radford et al. Learning with Deep Convolutional Generative Adversarial Nteworks ICLR 2016

85 summary semantic segmentation Fully convolutional networks can be applied for efficient classification of all pixels. To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers). upsampling can be done by de-convolution and de-pooling operations. Instance segmentation can be performed by combining object detection and semantic segmentation pipelines. slide credit: Li, Karpathy, johnson