Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 3 Object detection Region based CNN (RCNN) Input image Extract region proposal Compute CNN features Any proposal method Any architecture (e.g., selective search, edgebox) Classification Softmax, SVM Independent evaluation of each proposal Bounding box regression improves detection accuracy. Mean average precision (map): 53.7% with bounding box regression in VOC 200 test set [Girshick4] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 204 4 Motivation Selective Search Sliding window approach is not feasible for object detection with convolutional neural networks. We need a more faster method to identify object candidates. Finding object proposals Greedy hierarchical superpixel segmentation Diversification of superpixel construction and merge Using a variety of color spaces Using different similarity measures Varying staring regions [Uijlings3] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders: Selective Search for Object Recognition. IJCV 203

Bounding Box Regression Detection Results Learning a transformation of bounding box VOC 200 test set Region proposal:,,, Ground truth:,,, Transformation:,,, exp Feature analysis on VOC 2007 test set exp 5 argmin CNN pool5 feature 6 Fast RCNN Faster RCNN Fast RCNN + RPN Proposal computation into network Marginal cost of proposals: 0ms 7 Fast version of RCNN 9x faster in training and 23x faster in testing than RCNN A single feature computation and ROI pooling using object proposals Bounding box regression into network Single stage training using multi task loss [Girshick5] R. Girshick: Fast R CNN, ICCV 205 [Ren5] S. Ren, K. He, R. Girshick, J. Sun: Faster R CNN: Towards Real Time Object Detection with Region Proposal Networks. NIPS 205 8

Object Detection Performance Faster RCNN with ResNet RCNN family achieves the state of the art performance in object detection! Pascal VOC 2007 Object Detection map (%) 9 0 Faster RCNN with ResNet Visual Tracking with Convolutional Neural Networks 2

Main Idea Training shared features and domain specific classifiers jointly. Domain Domain specific classifiers Visual Tracking MDNet (Multi Domain Network) Multi domain learning Separating shared and domain specific layers Shared feature representation Domain 2 Domain 3 Domain 4 3 Transfer to a new domain Multi Domain Learning [Nam5] Hyeonseob Nam, Bohyung Han: Learning Multi Domain Convolutional Neural Networks for Visual Tracking, CVPR 206 4 The Winner of Visual Object Tracking Challenge 205 Online Tracking using MDNet Features Iteration #nk+ #nk+2 Transfer shared features New Sequence 5 6

Online Tracking using MDNet Features Online Tracking: Overview : positive score Transfer shared features Frame 2 argmax x New Sequence Draw target candidates Find the optimal state Collect training samples Update the CNN if needed Fine Tuning Repeat for the next frame 7 8 Long Term Update Performed at regular intervals Using long term training samples For Robustness Online Network Update Long-term update Short Term Update Performed at abrupt appearance changes ( 0.5 Using short term training samples For Adaptiveness Provide a hard minibatch in each training iteration. Pool of Negative Samples Randomly draw samples Hard Negative Mining Select samples with highest scores A MINIBATCH Training CNN 9 0.82 0.9 0.86 0.93 0.94 0.85 0.73 0.78 0.66 0.38 0.53 0.47 0.62 0.83 0.88 Frame # Short-term update 20 Pool of Positive Samples Randomly draw samples

Hard Negative Mining Bounding Box Regression Positive sample Negative sample Improve the localization quality. DPM [Felzenszwalb et al. PAMI 0], R CNN [Girshick et al. CVPR 4] Frame Frame Ground-Truth st minibatch 5 th minibatch 30 th minibatch Positive samples Train a bounding box regression model. Tracking result Adjust the tracking result by bounding box regression. Training iteration 2 22 Results on OTB00 [Wu5] Results on VOT205 Protocol MDNet is trained with 58 sequences from {VOT 3, 4, 5} excluding {OTB00}. Distance precision and overlap success rate by One Pass Evaluation (OPE) 23 [Wu5] Y. Wu, J. Lim, M. H. Yang: Object Tracking Benchmark. TPAMI 205 24 Ground truth Our 5 repetitions

Semantic Segmentation Segmenting images based on its semantic notion Semantic Segmentation by Fully Convolutional Network 25 26 Semantic Segmentation using CNN Image classification Fully Convolutional Network (FCN) Interpreting fully connected layers as convolution layers Each fully connected layer is identical to a convolution layer with a large spatial filter that covers entire input field. Query image Semantic segmentation Given an input image, obtain pixel wise segmentation mask using a deep Convolutional Neural Network (CNN) fc7 fc6 pool5 7 7 52 fc7 fc6 fc7 fc6 6 6 6 6 pool5 7 7 52 pool5 22 22 52 Fully connected layers Convolution layers For the larger Input field Query image 27 28

FCN for Semantic Segmentation Network architecture [Long5] End to end CNN architecture for semantic segmentation Interpret fully connected layers to convolutional layers 500x500x3 Bilinear interpolation filter Deconvolution Filter Same filter for every class No filter learning! How does this deconvolution work? Deconvolution layer is fixed. Fining tuning convolutional layers of the network with segmentation ground truth. 6x6x2 seg Deconvolution Fixed Pretrained on ImageNet Fine tuned for segmentation 64x64 bilinear interpolation [Long5] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation. CVPR 205 29 30 Skip Architecture Ensemble of three different scales Combining complementary features More semantic Limitations of FCN based Semantic Segmentation Coarse output score map A single bilinear filter should handle the variations in all kinds of object classes. Difficult to capture detailed structure of objects in image Fixed size receptive field Unable to handle multiple scales Difficult to delineate too small or large objects compared to the size of rec eptive field Noisy predictions due to skip architecture Trade off between details and noises Minor quantitative performance improvement 3 More detailed 32

Results and Limitations Results and Limitations Input image GT FCN 32s FCN 6s FCN 8s Input image GT FCN 32s FCN 6s FCN 8s 33 34 35