Convolutional Feature Maps

Similar documents
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Fast R-CNN Object detection with Caffe

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Bert Huang Department of Computer Science Virginia Tech

Image and Video Understanding

Deformable Part Models with CNN Features

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Pedestrian Detection with RCNN

Compacting ConvNets for end to end Learning

R-CNN minus R. 1 Introduction. Karel Lenc Department of Engineering Science, University of Oxford, Oxford, UK.

CAP 6412 Advanced Computer Vision

Pedestrian Detection using R-CNN

arxiv: v1 [cs.cv] 29 Apr 2016

Deep Residual Networks

Image Classification for Dogs and Cats

Learning and transferring mid-level image representions using convolutional neural networks

Semantic Recognition: Object Detection and Scene Segmentation

Latest Advances in Deep Learning. Yao Chou

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

arxiv: v2 [cs.cv] 27 Sep 2015

Object Detection in Video using Faster R-CNN

Local features and matching. Image classification & object localization

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Do Convnets Learn Correspondence?

arxiv: v6 [cs.cv] 10 Apr 2015

Segmentation as Selective Search for Object Recognition

SSD: Single Shot MultiBox Detector

Steven C.H. Hoi School of Information Systems Singapore Management University

Multi-view Face Detection Using Deep Convolutional Neural Networks

arxiv: v1 [cs.cv] 6 Feb 2015

Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Task-driven Progressive Part Localization for Fine-grained Recognition

arxiv: v2 [cs.cv] 9 Mar 2016

Edge Boxes: Locating Object Proposals from Edges

Weakly Supervised Object Boundaries Supplementary material

RECOGNIZING objects and localizing them in images is

HE Shuncheng March 20, 2016

Semantic Image Segmentation and Web-Supervised Visual Learning

Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer

Understanding Deep Image Representations by Inverting Them

SEMANTIC CONTEXT AND DEPTH-AWARE OBJECT PROPOSAL GENERATION

The Delicate Art of Flower Classification

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Going Deeper with Convolutional Neural Network for Intelligent Transportation

Conditional Random Fields as Recurrent Neural Networks

SIGNAL INTERPRETATION

VideoStory A New Mul1media Embedding for Few- Example Recogni1on and Transla1on of Events

Weakly Supervised Fine-Grained Categorization with Part-Based Image Representation

The Visual Internet of Things System Based on Depth Camera

Getting Started with Caffe Julien Demouth, Senior Engineer

Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling

Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN

3D Model based Object Class Detection in An Arbitrary View

Applications of Deep Learning to the GEOINT mission. June 2015

CNN Based Object Detection in Large Video Images. WangTao, IQIYI ltd

Object Categorization using Co-Occurrence, Location and Appearance

Online Tracking by Learning Discriminative Saliency Map with Convolutional Neural Network

Group Sparse Coding. Fernando Pereira Google Mountain View, CA Dennis Strelow Google Mountain View, CA

Object Detection from Video Tubelets with Convolutional Neural Networks

siftservice.com - Turning a Computer Vision algorithm into a World Wide Web Service

Taking Inverse Graphics Seriously

arxiv:submit/ [cs.cv] 13 Apr 2016

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

High Level Describable Attributes for Predicting Aesthetics and Interestingness

Taking a Deeper Look at Pedestrians

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Marr Revisited: 2D-3D Alignment via Surface Normal Prediction

Keypoint Density-based Region Proposal for Fine-Grained Object Detection and Classification using Regions with Convolutional Neural Network Features

Object class recognition using unsupervised scale-invariant learning

Learning Object Class Detectors from Weakly Annotated Video

Lecture 2: The SVM classifier

Classification using Intersection Kernel Support Vector Machines is Efficient

PANDA: Pose Aligned Networks for Deep Attribute Modeling

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

arxiv: v1 [cs.cv] 18 May 2015

TouchPaper - An Augmented Reality Application with Cloud-Based Image Recognition Service

Image Super-Resolution Using Deep Convolutional Networks

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

Big Data: Image & Video Analytics

Learning to Process Natural Language in Big Data Environment

CIKM 2015 Melbourne Australia Oct. 22, 2015 Building a Better Connected World with Data Mining and Artificial Intelligence Technologies

Locality-constrained Linear Coding for Image Classification

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

EdVidParse: Detecting People and Content in Educational Videos

Image Captioning A survey of recent deep-learning approaches

Multi-Scale Improves Boundary Detection in Natural Images

Practical Tour of Visual tracking. David Fleet and Allan Jepson January, 2006

Learning Motion Categories using both Semantic and Structural Information

An Empirical Evaluation of Current Convolutional Architectures Ability to Manage Nuisance Location and Scale Variability

arxiv: v2 [cs.cv] 19 Apr 2014

Two-Stream Convolutional Networks for Action Recognition in Videos

Transcription:

Convolutional Feature Maps Elements of efficient (and accurate) CNN-based object detection Kaiming He Microsoft Research Asia (MSRA) ICCV 2015 Tutorial on Tools for Efficient Object Detection

Overview of this section Quick introduction to convolutional feature maps Intuitions: into the black boxes How object detection networks & region proposal networks are designed Bridging the gap between hand-engineered and deep learning systems Focusing on forward propagation (inference) Backward propagation (training) covered by Ross s section

Object Detection = What, and Where Localization Where? person : 0.992 Recognition What? car : 1.000 horse : 0.993 dog : 0.997 person : 0.979 We need a building block that tells us what and where

Object Detection = What, and Where Convolutional: sliding-window operations Feature: encoding what (and implicitly encoding where ) Map: explicitly encoding where

Convolutional Layers Convolutional layers are locally connected a filter/kernel/window slides on the image or the previous map the position of the filter explicitly provides information for localizing local spatial information w.r.t. the window is encoded in the channels

Convolutional Layers Convolutional layers share weights spatially: translation-invariant Translation-invariant: a translated region will produce the same response at the correspondingly translated position A local pattern s convolutional response can be re-used by different candidate regions

Convolutional Layers Convolutional layers can be applied to images of any sizes, yielding proportionally-sized outputs W H W H W 2 H 2 W 4 H 4

HOG by Convolutional Layers Steps of computing HOG: Computing image gradients Binning gradients into 18 directions Computing cell histograms Normalizing cell histograms Convolutional perspectives: Horizontal/vertical edge filters Directional filters + gating (non-linearity) Sum/average pooling Local response normalization (LRN) see [Mahendran & Vedaldi, CVPR 2015] HOG, dense SIFT, and many other hand-engineered features are convolutional feature maps. Aravindh Mahendran & Andrea Vedaldi. Understanding Deep Image Representations by Inverting Them. CVPR 2015

Feature Maps = features and their locations Convolutional: sliding-window operations Feature: encoding what (and implicitly encoding where ) Map: explicitly encoding where

Feature Maps = features and their locations ImageNet images with strongest responses of this channel one feature map of conv 5 (#55 in 256 channels of a model trained on ImageNet) Intuition of this response: There is a circle-shaped object (likely a tire) at this position. What Where Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Feature Maps = features and their locations ImageNet images with strongest responses of this channel one feature map of conv 5 (#66 in 256 channels of a model trained on ImageNet) Intuition of this response: There is a λ-shaped object (likely an underarm) at this position. What Where Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Feature Maps = features and their locations Visualizing one response (by Zeiler and Fergus)? image a feature map keep one response (e.g., the strongest) Matthew D. Zeiler & Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014.

Feature Maps = features and their locations Visualizing one response conv3 image credit: Zeiler & Fergus Matthew D. Zeiler & Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014.

Feature Maps = features and their locations Visualizing one response Intuition of this visualization: There is a dog-head shape at this position. Location of a feature: explicitly represents where it is. Responses of a feature: encode what it is, and implicitly encode finer position information finer position information is encoded in the channel dimensions (e.g., bbox regression from responses at one pixel as in RPN) conv5 image credit: Zeiler & Fergus Matthew D. Zeiler & Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014.

Receptive Field Receptive field of the first layer is the filter size Receptive field (w.r.t. input image) of a deeper layer depends on all previous layers filter size and strides Correspondence between a feature map pixel and an image pixel is not unique Map a feature map pixel to the center of the receptive field on the image in the SPP-net paper Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Receptive Field How to compute the center of the receptive field A simple solution For each layer, pad F/2 pixels for a filter size F (e.g., pad 1 pixel for a filter size of 3) On each feature map, the response at (0, 0) has a receptive field centered at (0, 0) on the image On each feature map, the response at (x, y) has a receptive field centered at (Sx, Sy) on the image (stride S) A general solution See [Karel Lenc & Andrea Vedaldi] R-CNN minus R. BMVC 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Region-based CNN Features warped region CNN aeroplane? no... person? yes... tvmonitor? no. figure credit: R. Girshick et al. input image region proposals ~2,000 1 CNN for each region classify regions R-CNN pipeline R. Girshick, J. Donahue, T. Darrell, & J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014.

Region-based CNN Features Given proposal regions, what we need is a feature for each region R-CNN: cropping an image region + CNN on region, requires 2000 CNN computations What about cropping feature map regions?

Regions on Feature Maps image region feature map region Compute convolutional feature maps on the entire image only once. Project an image region to a feature map region (using correspondence of the receptive field center) Extract a region-based feature from the feature map region Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Regions on Feature Maps? warp image region feature map region Fixed-length features are required by fully-connected layers or SVM But how to produce a fixed-length feature from a feature map region? Solutions in traditional compute vision: Bag-of-words, SPM Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Bag-of-words & Spatial Pyramid Matching SIFT/HOG-based feature maps + level 0 + + + + + + + level 1 + + + + + + + + level 2 + + + + + + + + + + + + + + + + + + + + pooling pooling pooling + + + Bag-of-words [J. Sivic & A. Zisserman, ICCV 2003] Spatial Pyramid Matching (SPM) [K. Grauman & T. Darrell, ICCV 2005] [S. Lazebnik et al, CVPR 2006] figure credit: S. Lazebnik et al. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Spatial Pyramid Pooling (SPP) Layer fix the number of bins (instead of filter sizes) adaptively-sized bins a finer level maintains explicit spatial information pooling concatenate, fc layers a coarser level removes explicit spatial information (bag-of-features) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Spatial Pyramid Pooling (SPP) Layer Pre-trained nets often have a single-resolution pooling layer (7x7 for VGG nets) To adapt to a pre-trained net, a single-level pyramid is useable Region-of-Interest (RoI) pooling [R. Girshick, ICCV 2015] pooling concatenate, fc layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Single-scale and Multi-scale Feature Maps Feature Pyramid Resize the input image to multiple scales Compute feature maps for each scale Used for HOG/SIFT features and convolutional features (OverFeat [Sermanet et al. 2013]) feature pyramid image pyramid Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Single-scale and Multi-scale Feature Maps But deep convolutional feature maps perform well at a single scale SPP-net 1-scale SPP-net 5-scale pool 5 43.0 44.9 fc 6 42.5 44.8 fine-tuned fc 6 52.3 53.7 fine-tuned fc 7 54.5 55.2 fine-tuned fc 7 bbox reg 58.0 59.2 conv time 0.053s 0.293s fc time 0.089s 0.089s total time 0.142s 0.382s detection map on PASCAL VOC 2007, with ZF-net pre-trained on ImageNet this table is from [K. He, et al. 2014] Also observed in Fast R-CNN and VGG nets Good speed-vs-accuracy tradeoff Learn to be scale-invariant from pretraining data (ImageNet) (note: but if good accuracy is desired, feature pyramids are still needed) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

R-CNN vs. Fast R-CNN (forward pipeline) feature feature feature feature feature feature feature SPP/RoI pooling CNN CNN CNN CNN CNN image image R-CNN Extract image regions 1 CNN per region (2000 CNNs) Classify region-based features SPP-net & Fast R-CNN (the same forward pipeline) 1 CNN on the entire image Extract features from feature map regions Classify region-based features Ross Girshick. Fast R-CNN. ICCV 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

R-CNN vs. Fast R-CNN (forward pipeline) feature feature feature feature feature feature feature SPP/RoI pooling CNN CNN CNN CNN CNN image image R-CNN Complexity: ~224 224 2000 SPP-net & Fast R-CNN (the same forward pipeline) Complexity: ~600 1000 1 ~160x faster than R-CNN Ross Girshick. Fast R-CNN. ICCV 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

Region Proposal from Feature Maps Object detection networks are fast (0.2s) but what about region proposal? Selective Search [Uijlings et al. ICCV 2011]: 2s per image EdgeBoxes [Zitnick & Dollar. ECCV 2014]: 0.2s per image Can we do region proposal on the same set of feature maps? Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Feature Maps = features and their locations Convolutional: sliding-window operations Feature: encoding what (and implicitly encoding where ) Map: explicitly encoding where

Region Proposal from Feature Maps By decoding one response at a single pixel, we can still roughly see the object outline * Revisiting visualizations from Zeiler & Fergus Finer localization information has been encoded in the channels of a convolutional feature response Extract this information for better localization * Zeiler & Fergus s method traces unpooling information so the visualization involves more than a single response. But other visualization methods reveal similar patterns. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Region Proposal from Feature Maps a feature vector (e.g., 256-d) The channels of this feature vector encodes finer localization information image feature map The spatial position of this feature provides coarse locations Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Region Proposal Network Slide a small window on the feature map Build a small network for: classifying object or not-object, and regressing bbox locations classify obj./not-obj. regress box locations Position of the sliding window provides localization information with reference to the image Box regression provides finer localization information with reference to this sliding window sliding window convolutional feature map Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Anchors as references Anchors: pre-defined reference boxes Box regression is with reference to anchors: regressing an anchor box to a ground-truth box n scores 4n coordinates n anchors Object probability is with reference to anchors, e.g.: anchors as positive samples: if IoU > 0.7 or IoU is max anchors as negative samples: if IoU < 0.3 256-d Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Anchors as references Anchors: pre-defined reference boxes n scores 4n coordinates n anchors Translation-invariant anchors: the same set of anchors are used at each sliding position the same prediction functions (with reference to the sliding window) are used a translated object will have a translated prediction 256-d Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Anchors as references Anchors: pre-defined reference boxes n scores 4n coordinates n anchors Multi-scale/size anchors: multiple anchors are used at each position: e.g., 3 scales (128 2, 256 2, 512 2 ) and 3 aspect ratios (2:1, 1:1, 1:2) yield 9 anchors each anchor has its own prediction function single-scale features, multi-scale predictions 256-d Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Anchors as references Comparisons of multi-scale strategies feature image Image/Feature Pyramid Filter Pyramid Anchor Pyramid Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Region Proposal Network RPN is fully convolutional [Long et al. 2015] n scores 4n coordinates n anchors RPN is trained end-to-end 256-d RPN shares convolutional feature maps with the detection network (covered in Ross s section) Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

detector Faster R-CNN RoI pooling proposals RPN feature map system time 07 data 07+12 data R-CNN ~50s 66.0 - Fast R-CNN ~2s 66.9 70.0 Faster R-CNN 198ms 69.9 73.2 detection map on PASCAL VOC 2007, with VGG-16 pre-trained on ImageNet CNN image Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Example detection results of Faster R-CNN car : 1.000 person : 0.992 horse : 0.993 person : 0.974 dog : 0.983 cat : 0.928 dog : 0.989 bus: 0.980 boat : 0.853 person : 0.993 person : 0.753 person : 0.972 person : 0.981 person : 0.907 Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Keys to efficient CNN-based object detection Feature sharing R-CNN => SPP-net & Fast R-CNN: sharing features among proposal regions Fast R-CNN => Faster R-CNN: sharing features between proposal and detection All are done by shared convolutional feature maps Efficient multi-scale solutions Single-scale convolutional feature maps are good trade-offs Multi-scale anchors are fast and flexible

Conclusion of this section Quick introduction to convolutional feature maps Intuitions: into the black boxes How object detection networks & region proposal networks are designed Bridging the gap between hand-engineered and deep learning systems Focusing on forward propagation (inference) Backward propagation (training) covered by Ross s section