Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Similar documents
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Convolutional Feature Maps

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Fast R-CNN Object detection with Caffe

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Compacting ConvNets for end to end Learning

Deformable Part Models with CNN Features

Pedestrian Detection with RCNN

Image and Video Understanding

Semantic Recognition: Object Detection and Scene Segmentation

CAP 6412 Advanced Computer Vision

Bert Huang Department of Computer Science Virginia Tech

Steven C.H. Hoi School of Information Systems Singapore Management University

Pedestrian Detection using R-CNN

Image Classification for Dogs and Cats

SIGNAL INTERPRETATION

arxiv: v1 [cs.cv] 29 Apr 2016

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

Applications of Deep Learning to the GEOINT mission. June 2015

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Object Detection in Video using Faster R-CNN

Learning and transferring mid-level image representions using convolutional neural networks

Local features and matching. Image classification & object localization

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Latest Advances in Deep Learning. Yao Chou

arxiv: v6 [cs.cv] 10 Apr 2015

Deep Residual Networks

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Object Recognition. Selim Aksoy. Bilkent University

Semantic Image Segmentation and Web-Supervised Visual Learning

R-CNN minus R. 1 Introduction. Karel Lenc Department of Engineering Science, University of Oxford, Oxford, UK.

arxiv: v2 [cs.cv] 27 Sep 2015

Multi-view Face Detection Using Deep Convolutional Neural Networks

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Getting Started with Caffe Julien Demouth, Senior Engineer

Convolution. 1D Formula: 2D Formula: Example on the web:

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Introduction to Machine Learning CMU-10701

Do Convnets Learn Correspondence?

Learning to Process Natural Language in Big Data Environment

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

CNN Based Object Detection in Large Video Images. WangTao, IQIYI ltd

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Environmental Remote Sensing GEOG 2021

High Quality Image Magnification using Cross-Scale Self-Similarity

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Big Data: Image & Video Analytics

Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN

T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

The Visual Internet of Things System Based on Depth Camera

arxiv: v2 [cs.cv] 19 Apr 2014

Denoising Convolutional Autoencoders for Noisy Speech Recognition

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers

SSD: Single Shot MultiBox Detector

Weakly Supervised Fine-Grained Categorization with Part-Based Image Representation

Object Detection from Video Tubelets with Convolutional Neural Networks

Bildverarbeitung und Mustererkennung Image Processing and Pattern Recognition

Two-Stream Convolutional Networks for Action Recognition in Videos

Image Super-Resolution Using Deep Convolutional Networks

EdVidParse: Detecting People and Content in Educational Videos

Going Deeper with Convolutional Neural Network for Intelligent Transportation

Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling

Part-Based Recognition

3D Model based Object Class Detection in An Arbitrary View

Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite

arxiv: v1 [cs.cv] 6 Feb 2015

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

Task-driven Progressive Part Localization for Fine-grained Recognition

Determining optimal window size for texture feature extraction methods

Digital image processing

The Relationship between Artificial Intelligence and Finance

arxiv: v2 [cs.cv] 9 Mar 2016

Computational Foundations of Cognitive Science

Probabilistic Latent Semantic Analysis (plsa)

Limitations of Human Vision. What is computer vision? What is computer vision (cont d)?

Edge Boxes: Locating Object Proposals from Edges

arxiv: v2 [cs.cv] 19 Jun 2015

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Transform-based Domain Adaptation for Big Data

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

InstaNet: Object Classification Applied to Instagram Image Streams

Water Flow in. Alex Vlachos, Valve July 28, 2010

IMPLICIT SHAPE MODELS FOR OBJECT DETECTION IN 3D POINT CLOUDS

Programming Exercise 3: Multi-class Classification and Neural Networks

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Point Lattices in Computer Graphics and Visualization how signal processing may help computer graphics

HE Shuncheng March 20, 2016

Distributed forests for MapReduce-based machine learning

RECOGNIZING objects and localizing them in images is

Topological Data Analysis Applications to Computer Vision

Transcription:

Module 5 Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Previously, end-to-end.. Dog Slide credit: Jose M 2

Previously, end-to-end.. Dog Learned Representation Slide credit: Jose M 3

Previously, end-to-end.. Dog Learned Representation Part I: End-to-end learning (E2E) 4

Previously, end-to-end.. Learned Representation Task A (eg. image classification) Part I: End-to-end learning (E2E) 5

Previously,finetuning.. Part I: End-to-end learning (E2E) Learned Representation Domain A Transfer Part I: End-to-end learning (E2E) Part I: End-to-end learning (E2E) Fine-tuned Learned Representation Domain B Part I : End-to-End Fine-Tuning (FT) 6 slide credit: X. Giro

Previously,finetuning.. Fine-tuning a pre-trained network Slide credit: Victor Campos, Layer-wise CNN surgery for Visual Sentiment Prediction (ETSETB 2015) 7

Previously,finetuning.. Fine-tuning a pre-trained network Fine-tuning: High learning rate in new layer, and low learning rate in all other layers. Slide credit: Victor Campos, Layer-wise CNN surgery for Visual Sentiment Prediction (ETSETB 2015) 8

Previously, off-the-shelf features.. Learned Representation Task A (eg. image classification) Part I: End-to-end learning (E2E) Part II: Off-the-shelf features Task B (eg. image retrieval) 9 slide credit: X. Giro

Previously, off-the-shelf features.. Image classification: image as an input, label as output Orange 1 1 df d d d x y F spatial coded image representations (like spatial pyramids) orderless image representation (like BOW)

Two deep lectures in M5 Deep ConvNets for Recognition at... Global Scale (today s lecture) Local Scale (next lecture)

Image Classification Image classification: image as an input, label as output Orange How to process non-squared images? resize zero padding largest centred square

Local object recognition object localization (single object) object detection semantic segmentation

Classification+LOCALIZATION slide credit: Li, Karpathy, Johnson

Localization as regression slide credit: Li, Karpathy, Johnson

Localization as regression slide credit: Li, Karpathy, Johnson

Localization as regression classification head slide credit: Li, Karpathy, Johnson regression head

Localization as regression classification head slide credit: Li, Karpathy, Johnson regression head

Localization as regression slide credit: Li, Karpathy, Johnson

Localization as regression Problem: multiple classes Classification head: C- class scores slide credit: Li, Karpathy, Johnson regression head: Cx4 - numbers

Localization as regression slide credit: Li, Karpathy, Johnson

Localization as regression (example) Example of localization of cloths. Regression is done in two steps: first the person bounding box and then the cloth bounding boxes (master project 2015) Esteve Cervantes: Evaluating deep features for Fashion Recognition

Local object recognition object localization (single object) object detection any ideas? semantic segmentation

227 227 227 227 Sliding window 227 227 0.03 227 classification + regression 227 classification + regression 0.83 Compute a new regressed bounding box and classification score for all sliding window positions.

227 227 Sliding window 227 0.83 227 0.99 Repeat for different scales and combine all results (e.g. with non maxima suppression)

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 5 1 1 6 car/not car 2 6 10 conv 1 fc1 fc2 What are the spatial coordinates of conv1? 10 Part of the convolutional features are the same and do not need recomputation! 12x17

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 How many 10x10 windows are there in this 12x17 image? 12x17

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 5 12 5x5 8 The convolutions can be computed in a single pass. 12x17 17 13 conv 1

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 12 5 6x6x5 1x1x10 5x5 8 12x17 17 13 conv 1 fc2

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 12 5 3 10 5x5 8 12x17 17 13 conv 1 (5x5x3) 8 fc2=conv2 (6x6x5)

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network conv1 filter (5x5) 10 10 5x5 6 5 1 1 2 6 10 conv 1 fc1 fc2 car/not car 10 12 5 3 10 1x1x2 5x5 8 12x17 17 13 conv 1 (5x5x3) 8 fc2=conv2 (6x6x5) fc3

10 Sliding window (efficient computation) Let us for simplicity consider a simple three layer network 10 conv1 filter (5x5) 10 12 10 5x5 5 1 1 6 car/not car 2 6 10 conv 1 fc1 fc2 We have the 8x3=24 classification scores sharing computation of the convolutional feaures. 5 10 3 2 3 5x5 8 12x17 17 13 conv 1 5 fillters of (5x5x3) 8 fc2=conv2 10 filters of (6x6x5) 8 fc3=conv3 2 filters of (1x1x10)

Sliding window (efficient computation) Networks can be written as fully convolutional networks to speed up computation at testing time. Example of bear and fish detection on multiple scales. Semanet et al, Integrated Recognition, Localization and Detection using Convolutional Networks ICLR 2014

object proposals object proposal methods compute boxes which potentially contain an object. Features for each box are extracted and a classifier is applied. typically thousands of boxes (but much less than sliding window) Many different approaches: selective search, edge boxes, GOP, etc. selective search K. Van de Sande et al. Segmentation as selective search for object recognition. ICCV 2011.

object proposals (RCNN) bounding box regression car: yes person : no 1. compute object proposals (~2k) 2. warp dilated bounding box 3. compute CNN features 4. classify regions Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (RCNN) Alex Net Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (RCNN) remove last layer and finetune for 20 PASCAL classes Alex Net Use fc7 4096-d vector as the description of the bounding box. Train a SVM on this representation for classification Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (RCNN) slide credit: Girshick

object proposals (RCNN)

object proposals (RCNN) slide credit: Li, Karpathy, Johnson

object proposals (RCNN) drawbacks: not end-to-end warping of boxes lots of double computation (overlap of bounding boxes) improved bounding box car: yes person : no 1. compute object proposals (~2k) 2. warp dilated bounding box 3. compute CNN features 4. classify regions Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." CVPR 2014.

object proposals (Fast R-CNN)

shared computation (conv1-conv5) object proposals (Fast R-CNN) conv 5 compute ones the convolutional features per image. He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

shared computation object proposals (Fast R-CNN) conv 5 compute ones the convolutional features extract features from conv5 for all bb s This was first proposed by: He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." PAMI 2015

shared computation object proposals (Fast R-CNN) for all bounding boxes: Region of Interest pooling (ROI pooling) pool the features in a spatial grid.

shared computation object proposals (Fast R-CNN) classification: log loss ROI pooling: FCs regression: smooth L1 loss pool the features in a spatial grid end-to-end training

object proposals (Fast R-CNN) multi-task improves also classification performance. end-to-end improves results Fast R-CNN R-CNN Train time 9.5 84 -speedup 8.8x - Test time/image 0.32s 47s Test speedup 146x - map 66.9% 66.0% Test time does not include object proposal computation (which is now the bottleneck)

shared computation object proposals (Faster R-CNN) FCs Region Proposal Network (RPN) ROI pooling: conv5 compute the object proposals directly in the network.

object proposals (Faster R-CNN) Slide a window over the feature map. Add a network which classifies and regresses the bounding boxes. The classification score provides the confidence of the presence of object. slide credit: Kaming He

object proposals (Faster R-CNN) Slide a window over the feature map. Add a network which classifies and regresses the bounding boxes. The classification score provides the confidence of the presence of object. Use N anchors for proposals of varying aspect ratios. slide credit: Kaming He

object proposals (Faster R-CNN) Model Time Edge boxes + R-CNN 0.25 sec + 1000*ConvTime + 1000*FcTime Edge boxes + fast R-CNN 0.25 sec + 1*ConvTime + 1000*FcTime faster R-CNN 1*ConvTime + 1000*FcTime Computation for 1000 boxes. slide credit: Kaming He

object proposals (Faster R-CNN) slide credit: Li, Karpathy, johnson

object proposals (Faster R-CNN) slide credit: Li, Karpathy, johnson

object localization Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with residual networks and Faster RCNN. 2015 challenge

object localization Winner ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with residual networks and Faster RCNN 2015 challenge

summary object detection object localization: when there is one or a known number of objects/classes you can do object localization by adding a regression head to your network. Sliding window + CNN can be computed efficiently by writing the network as a fully convolutional network. Object proposal methods are straightforwardly combined with CNNs, but for fast/good results consider: adding a regression head to improve bounding box estimation. share computation of the convolutional features (SPP) end-to-end training of network (fast RCNN) include Region Proposal Network for fast object proposals within the network (faster RCNN). slide credit: Li, Karpathy, johnson

Local object recognition object localization (single object) object detection semantic segmentation

semantic segmentation semantic segmentation: assign a class to all pixels instance segmentation : assign pixels to a particular instance of a class (chair1, etc..)

semantic segmentation ConvNet predict center pixel Write network as fully convolutional network and apply to image Because of the convolutions the resolution is smaller and upsampling is required

semantic segmentation pixelwise loss Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1] Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1]

input input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1] Convolution (3x3) padding [1 1 1 1] stride [2 2]

input input semantic segmentation Convolution (3x3) padding [1 1 1 1] stride [1 1] Convolution (3x3) padding [1 1 1 1] stride [2 2]

input semantic segmentation deconvolution (3x3) padding [1 1 1 1] stride [2 2]

input semantic segmentation deconvolution (3x3) padding [1 1 1 1] stride [2 2] deconvolutions are also called fractionally strided convolutions, convolution transpose.

semantic segmentation Noh et al. ICCV 2015

semantic segmentation Noh et al. ICCV 2015

semantic segmentation combine where (local, shallow) with what (global, deep) Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

semantic segmentation skip layers interp + sum interp + sum dense output Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

semantic segmentation input image stride 32 stride 16 stride 8 ground truth no skips 1 skip 2 skips Long et al., Fully Convolutional Networks for Semantic Segmentation, ICCV 2015

semantic segmentation Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

semantic segmentation Surface normals results Eigen, Fergus, Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture, ICCV 2015

instance segmentation Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

instance segmentation Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

instance segmentation Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

instance segmentation results ground-truth Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

Generative Adversarial Networks noise Fractionally strided convolutions (deconvolutions) can be used to generate images. Dai et al. Instance aware Semantic Segmentation via Multi-task Network Cascades, arxiv 2015.

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x I can train a discriminative network D which is trained to distinguish real horse images x from generated horse images G(z) D max log D x log 1 D D G z

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x I can then optimize my generative network to fool the discriminative network. D min G maxlog D x log 1 D D G z

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x D You can re-optimize the Discriminate network D, etc... min G maxlog D x log 1 D D G z

Generative Adversarial Networks Consider I would like to generate images of horses. My generated horse images G(z) are generated from noise z. generated horses G(z) real horses x You can re-optimize the Discriminate network D, etc...until D gives in... D min G maxlog D x log 1 D D G z Goodman et al. Generative Adversarial Nets NIPS 2014

Generative Adversarial Networks Examples of generated bedrooms. Unsupervised Representation Radford et al. Learning with Deep Convolutional Generative Adversarial Nteworks ICLR 2016

Generative Adversarial Networks Interpolation between points in z. Unsupervised Representation Radford et al. Learning with Deep Convolutional Generative Adversarial Nteworks ICLR 2016

summary semantic segmentation Fully convolutional networks can be applied for efficient classification of all pixels. To get high quality segmentations deep features of multiple scales need to be combined (e.g. with skip layers). upsampling can be done by de-convolution and de-pooling operations. Instance segmentation can be performed by combining object detection and semantic segmentation pipelines. slide credit: Li, Karpathy, johnson