Image and Video Understanding

Similar documents
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Convolutional Feature Maps

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Image Classification for Dogs and Cats

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Compacting ConvNets for end to end Learning

Steven C.H. Hoi School of Information Systems Singapore Management University

Introduction to Machine Learning CMU-10701

Deformable Part Models with CNN Features

Pedestrian Detection using R-CNN

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

arxiv: v2 [cs.cv] 19 Apr 2014

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Pedestrian Detection with RCNN

Fast R-CNN Object detection with Caffe

Bert Huang Department of Computer Science Virginia Tech

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

arxiv: v1 [cs.cv] 6 Feb 2015

Taking Inverse Graphics Seriously

arxiv: v6 [cs.cv] 10 Apr 2015

Learning and transferring mid-level image representions using convolutional neural networks

SIGNAL INTERPRETATION

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

Deep Residual Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Supporting Online Material for

HE Shuncheng March 20, 2016

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

CAP 6412 Advanced Computer Vision

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Object Detection in Video using Faster R-CNN

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Latest Advances in Deep Learning. Yao Chou

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Learning to Process Natural Language in Big Data Environment

Selecting Receptive Fields in Deep Networks

TRAFFIC sign recognition has direct real-world applications

Novelty Detection in image recognition using IRF Neural Networks properties

CNN Based Object Detection in Large Video Images. WangTao, IQIYI ltd

Semantic Recognition: Object Detection and Scene Segmentation

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

Applications of Deep Learning to the GEOINT mission. June 2015

ImageNet Classification with Deep Convolutional Neural Networks

arxiv: v2 [cs.cv] 19 Jun 2015

Lecture 6. Artificial Neural Networks

Multi-Column Deep Neural Network for Traffic Sign Classification

Deep learning applications and challenges in big data analytics

An Introduction to Neural Networks

The Delicate Art of Flower Classification

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Università degli Studi di Bologna

A Dynamic Convolutional Layer for Short Range Weather Prediction

Semantic Image Segmentation and Web-Supervised Visual Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Local features and matching. Image classification & object localization

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Implementation of Neural Networks with Theano.

arxiv: v1 [cs.cv] 18 May 2015

Do Convnets Learn Correspondence?

Understanding Deep Image Representations by Inverting Them

A Hybrid Neural Network-Latent Topic Model

PANDA: Pose Aligned Networks for Deep Attribute Modeling

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

arxiv: v2 [cs.cv] 27 Sep 2015

R-CNN minus R. 1 Introduction. Karel Lenc Department of Engineering Science, University of Oxford, Oxford, UK.

Chapter 4: Artificial Neural Networks

Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling

Deep Image: Scaling up Image Recognition

Deep Learning using Linear Support Vector Machines

Tracking in flussi video 3D. Ing. Samuele Salti

Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN

Image Super-Resolution Using Deep Convolutional Networks

Multi-view Face Detection Using Deep Convolutional Neural Networks

Learning Invariant Features through Topographic Filter Maps

Denoising Convolutional Autoencoders for Noisy Speech Recognition

Predict Influencers in the Social Network

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Learning Deep Face Representation

Efficient online learning of a non-negative sparse autoencoder

The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization

arxiv: v1 [cs.cv] 29 Apr 2016

An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048

Image denoising: Can plain Neural Networks compete with BM3D?

Generating more realistic images using gated MRF s

Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines

Machine Learning Introduction

Using Convolutional Neural Networks for Image Recognition

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Simplified Machine Learning for CUDA. Umar

Visualizing Higher-Layer Features of a Deep Network

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Transcription:

Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material is taken from tutorials at NIPS, CVPR and BMVC conferences.

Outline Convolutional Networs (ConvNets) for Image Classification Operations in each layer Architecture Training Visualizations Krizhevsky, A., Sutskever, I. and Hinton, G. E., ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 M. Zeiler & R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014 A. Mahendran and A. Vedaldi. Understanding Deep Image Representations by Inverting Them CVPR 2015 State of the art methods Applications of ConvNets K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, ICML 2015 K. He, X. Zhang, S. Ren, & j. Sun. Deep Residual Learning for Image Recognition. arxiv 2015.

Goal: Scene Understanding Car Person Leave a car in the desert Scenes Objects Actions Activities Desert Temporal input sequence Get out of car Open door

One application: Image retrieval

Deep Learning - breakthrough in visual and speech recognition Credit: B. Ginzburg

ConvNet Detection Examples

Credit: R. Fergus Summary: Compare: SIFT Descriptor Image Pixels Apply Gabor filters Spatial pool (Sum) Normalize to unit length Feature Vector

What are the weakest links limiting performance? Replace each component of the deformable part model detector with humans Good Features (part detection) and accurate Localization (NMS) are most important [Parikh & Zitnick CVPR 10]

Typical visual recognition pipeline Select / develop features Add on top of this Machine Learning for multi-class recognition and train classifier Image/Video Pixels Feature Extractor e.g. SIFT, HoG... Trainable Classifier e.g. SVM Object Class

Intuition Behind Deep Neural Nets Build features automatically based on training data Combine feature extraction and classification All the way from pixels classifier Each box is a feature detector One layer extracts features from output of previous layer Layer 1 Layer 2 Layer 3 CAR

Some Key Ingredients for Convolutional Neural Networks

Neural networks trained via backpropagation B-Prop Back-propagate error signal to get derivatives for learning parameters θ Compare outputs with correct answer to get error signal L outputs F-Prop Training hidden layers F-Prop / B-Prop (features) Learning by stochastic gradient descent (SGD): A) Compute loss L on weights small mini-batch of data input vector B) Compute gradient w.r.t. θ C) Use gradient to update θ (make a step in the opposite direction) Non-linearity Credit: G. Hinton

Neural networks trained via backpropagation Cat Compare outputs with correct answer to get error signal Back-propagate error signal to get derivatives for learning parameters θ Training F-Prop / B-Prop Learning by SGD: A) Compute loss on small mini-batch B) Compute gradient w.r.t. θ C) Use gradient to update θ input vector Credit: G. Hinton

Neural networks trained via backpropagation Dog Compare outputs with correct answer to get error signal Back-propagate error signal to get derivatives for learning parameters θ Training F-Prop / B-Prop Learning by SGD: A) Compute loss on small mini-batch B) Compute gradient w.r.t. θ C) Use gradient to update θ input vector Credit: G. Hinton

Credit: G. Hinton Neural networks testing Cat F-Prop

Credit: G. Hinton Neural networks testing Dog F-Prop

Summary: Modularity of backpropagation Credit: A. Vedaldi

Credit: M. A. Ranzato Motivation: Images as a composition of local parts Pixel-based representation Too many parameters!

Motivation: Images as a composition of local parts Patch-based representation Credit: M. A. Ranzato

Motivation: Images as a composition of local parts Patch-based representation Credit: M. A. Ranzato Still too many parameters!

Motivation: Images as a composition of local parts Convolution example Credit: M. A. Ranzato

Motivation: Images as a composition of local parts Convolution example Credit: M. A. Ranzato

Credit: R. Fergus Motivation: Images as a composition of local parts Filtering example Why translation equivariance? Input translation leads to a translation of features Fewer filters needed: no translated replications But still need to cover orientation/frequency Patch-based Patch-based Convolutional

Convolutional Neural Networks

Multistage HubelWiesel Architecture: An Old Idea for Local Shift Invariance [Hubel & Wiesel 1962] Simple cells detect local features Complex cells pool the outputs of simple cells within a retinotopic neighborhood. Simple cells Complex cells Multiple convolutions pooling & subsampling Convolutional Networks [LeCun 1988-present] Retinotopic Feature Maps Credit: Y. LeCun

Convolutional Neural Networks Neural network with specialized connectivity structure After a few convolution and subsampling stages, spatial resolution is very small Use fully connected layers up to classification at output layer [LeCun et al. 1989] [LeCun et al. 1989]

Training large ConvNets on Paradigm shift enabled by Large annotated dataset Strong regularization (dropout) GPU(s) for fast processing ~ 150 images/sec days weeks of training Imagenet database: 1K classes ~ 1K training images per class! ~ 1M training images [Russakovsky et al. ICCV 13]

Top-5 error rate % ImageNet Classification 2012 Krizhevsky et al. -- 16.4% error (top-5) Next best (non-convnet) 26.2% error 35 30 25 20 15 10 5 0 SuperVision ISI Oxford INRIA Amsterdam Credit: R. Fergus

Architecture of [Krizhevsky et al. NIPS 12] ImageNet Classification 2012 16.4% error (top-5) next best (variant of SIFT + Fisher Vectors) 26.2% error Same idea as in [LeCun 98] but on a larger scale: more training images (10 6 vs 10 3 ) more hidden layers Overall: 60,000,000 parameters which are trained on 2 GPUs for a week with several tricks to reduce overfitting Data augmentation DropOut (new regularization)

ConvNet Architecture Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) Supervised Train convolutional filters by back-propagating classification error Feature maps Pooling Non-linearity Convolution (Learned) Input Image Credit: R. Fergus

Components of Each Layer Pixels / Features Filter with Dictionary (convolutional or tiled) + Non-linearity Spatial/Feature (Sum or Max) [Optional] Normalization between feature responses Output Features Credit: R. Fergus

Convolutional filtering Why convolution? Statistics of images look similar at different locations Dependencies are very local Filtering is an opteration with translation equivariance... Input Feature Map Credit: R. Fergus

Non-Linearity Non-linearity applied to each response Per-feature independent Tanh Sigmoid: 1/(1+exp(-x)) Rectified linear : max(0,x) Simplifies backprop Makes learning faster Avoids saturation issues (much higher dynamic range) Preferred option Credit: A. Vedaldi

Pooling Spatial Pooling Non-overlapping / overlapping regions (-0.3% in error) Sum or max Provides invariance to local transformations Max Sum Credit: R. Fergus

Normalization Contrast normalization The same response for different contrasts is desired

Normalization Contrast normalization (between/across feature maps) Local mean = 0, local std. = 1, Local 7x7 Gaussian Equalizes the features maps Feature Maps Feature Maps After Contrast Normalization Credit: R. Fergus

Local feature normalisation Credit: A. Vedaldi

Recap: Components of a CNN CNN - multi-layer NN architecture Convolutional + Non-Linear Layer Sub-sampling Layer Convolutional +Non-Linear Layer Fully connected layers Supervised Feature Extraction Classification Credit: B. Ginzburg

Summary: Components of Each Layer Credit: A. Vedaldi

Architecture of Krizhevsky et al. Softmax Output Layer 7: Full Trained via backprop by maximizing the log-prob. of the correct class-label 8 layers total Trained on ImageNet dataset Layer 6: Full Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Input Image Credit: R. Fergus

Motivation: Improving Generalization: DropOut Random Forests generalize well due to averaging of many models Decision Trees are fast - ConvNets are slow many models are not feasible Similar to random forest bagging [Breiman 94], but differs in that parameters are shared For fully connected layers only: In training: Independently set each hidden unit activity to zero with 0.5 probability In testing multiply neuron output by 0.5 [Hinton et al. NIPS 12] Corresponds to averaging over exponentially many samples from different model architectures

[Chatfield et al. BMVC 14] Mean removal Centered (0- mean) RGB values. Input representation Further pre-processing tricks Centered (0-mean) RGB values. Data augmentation Train on 224x224 patches extracted randomly from images, and also their horizontal reflections An input image (256x256) Minus sign The mean input image [Krizhevsky et al. NIPS 12]

Test error (top-5) Credit: R. Fergus ImageNet Classification 2013/2014 Results http://www.image-net.org/challenges/lsvrc/2013/results.php 0.17 0.16 0.15 0.14 0.13 0.12 0.11 0.1 Pre-2012: 26.2% error 2012: 16.5% error 2013: 11.2% error

ImageNet Sample classifications [Krizhevsky et al. NIPS 12]

Validation ImageNet Sample classifications [Krizhevsky et al. NIPS 12]

ImageNet Classification Progress from 12-14 Credit: Y. LeCun

Credit: K. Simonyan

Better ( deeper) architectures exist now ILSVRC14 Winners: ~7.3% Top-5 error VGG: 16 layers of stacked 3x3 convolution with stride 1 Credit: K. Simonyan

Better ( deeper) architectures exist now ILSVRC14 Winners: ~7.3% Top-5 error VGG: 16 layers of stacked 3x3 convolution with stride 1 Credit: K. Simonyan

Better ( deeper) architectures exist now ILSVRC14 Winners: ~6.6% Top-5 error - GoogLeNet: composition of multi-scale dimensionreduced modules: Zeiler-Fergus Architecture (1 tower) - 1x1 convolutions serve as dimensionality reduction Convolution Pooling Softmax Other wire together what fires together Credit: C. Szegedy

In images, correlations tend to be local Credit: C. Szegedy

Cover very local clusters by 1x1 convolutions number of filters 1x1 Credit: C. Szegedy

Less spread out correlations number of filters 1x1 Credit: C. Szegedy

Cover more spread out clusters by 3x3 convolutions number of filters 1x1 3x3 Credit: C. Szegedy

Cover more spread out clusters by 5x5 convolutions number of filters 1x1 3x3 Credit: C. Szegedy

Cover more spread out clusters by 5x5 convolutions number of filters 1x1 3x3 5x5 Credit: C. Szegedy

A heterogeneous set of convolutions number of filters 1x1 3x3 5x5 Credit: C. Szegedy

Schematic view (naive version) number of filters 1x1 Filter concatenation 3x3 1x1 convolutions 3x3 convolutions 5x5 convolutions 5x5 Previous layer Credit: C. Szegedy

Naive idea Filter concatenation 1x1 convolutions 3x3 convolutions 5x5 convolutions Previous layer Credit: C. Szegedy

Naive idea (does not work!) - Too many parameters Filter concatenation 1x1 convolutions 3x3 convolutions 5x5 convolutions 3x3 max pooling Previous layer Credit: C. Szegedy

Inception module Filter concatenation 1x1 convolutions 3x3 convolutions 1x1 convolutions 5x5 convolutions 1x1 convolutions 1x1 convolutions 3x3 max pooling Previous layer 1x1 Convolution Only operates across channels Performs a linear projections of a single pixel s features Can be used to reduce dimensionality Increases depth

Inception 9 Inception modules Network in a network in a network... Convolution Pooling Softmax Other Credit: C. Szegedy

Classification failure cases Groundtruth: coffee mug GoogLeNet: table lamp lamp shade printer projector desktop computer Credit: C. Szegedy

Classification failure cases Groundtruth: hay GoogLeNet: sorrel (horse) hartebeest Arabian camel warthog gaselle Credit: C. Szegedy

Classification failure cases Groundtruth: Police car GoogLeNet: laptop hair drier binocular ATM machine seat belt Credit: C. Szegedy

Credit: K. Simonyan

Credit: K. Simonyan

Credit: K. Simonyan

Better ( deeper) architectures exist now ILSVRC15 Winners: ~3.57% Top-5 error Credit: K. He

Faster R-CNN + ResNet Examples Credit: K. He

Faster R-CNN + ResNet Examples Credit: K. He

Faster R-CNN + ResNet Examples Credit: K. He

Better ( deeper) architectures exist now ILSVRC15 Winners: Object detection revolution Credit: K. He

Better ( deeper) architectures exist now Ultra deep all 3x3 conv filters (except first: 7x7) spatial size /2 # filters x2 Uses Batch Normalisation during training Standard hyperparameters & data augmentation No hidden fc layer No max pooling (except after first layer) No dropout 2014 Credit: K. He

Simply stacking layers does not work Credit: K. He

Simply stacking layers does not work Credit: K. He

Solution: Residual learning Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv 2015. Credit: K. He

Breaking CNNs Intriguing properties of neural networks [Szegedy ICLR 2014]

What is going on? x E x x x E x Adversarial examples: modify the image to increase classifier error Explaining and Harnessing Adversarial Examples [Goodfellow ICLR 2015]

What is learned? Visualizing CNNs M. Zeiler & R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014 A. Mahendran and A. Vedaldi. Understanding Deep Image Representations by Inverting Them, CVPR 2015

Visualization using Deconvolutional Networks Provides way to map activations at high layers back to the input [Zeiler et al. CVPR 10, ICCV 11, ECCV 14] Feature maps Same operations as Convnet, but in reverse: Unpool feature maps Convolve unpooled maps Filters copied from Convnet Used here purely as a probe Originally proposed as unsupervised learning method No inference, no learning Unpooling Non-linearity Convolution (learned) Input Image Credit: R. Fergus

Credit: R. Fergus Unpooling Operation Switches record where the pooled activations came from to unpool the reconstructed layer above

Deconvnet Credit: R. Fergus Deconvnet Projection from Higher Layers [Zeiler and Fergus. ECCV 14] 0 0... Filters Feature Map... Filters Layer 2 Reconstruction Layer 2: Feature maps Layer 1 Reconstruction Layer 1: Feature maps Convnet Visualization Input Image

Visualizations of Higher Layers Use ImageNet 2012 validation set (stack of images) Push each image through network and look for image with the strongest activation for each feature map Feature Map... Take max activation from feature map associated with each filter Lower Layers Filters Use Deconvnet to project back to pixel space Use pooling switches distinctive to that activation Input Image Validation Images Credit: R. Fergus

Layer 1 Filters Credit: R. Fergus

Layer 1: Top-9 Patches Credit: R. Fergus

Layer 2: Top-1 Credit: R. Fergus

Layer 2: Top-9 Parts of input image that give strong activation of this feature map

Layer 2: Top-9 Patches Patches from validation images that give maximal activation of a given feature map

Layer 3: Top-1

Layer 3: Top-9

Layer 3: Top-9 Layer 3: Top-9 Patches

Layer 4: Top-1

Layer 4: Top-9

Layer 4: Top-9 patches

Layer 5: Top-1

Layer 5: Top-9

Layer 5: Top-9 Patches

Evolution of Features During Training

Credit: R. Fergus Evolution of Features During Training Higher layers evolve later in later epochs during training.

Visualizations can be used to improve model Visualization of Krizhevsky et al. s architecture showed some problems with layers 1 and 2 Alter architecture: smaller stride & filter size Visualizations look better and Performance improves Blocking artifacts Smaller stride for convolution Layer 2 filters Credit: R. Fergus

Credit: R. Fergus Too specific for low-level + dead filters Restrict size (smaller) Layer 1 filters Visualizations can be used to improve model Visualization of Krizhevsky et al. s architecture showed some problems with layers 1 and 2 Alter architecture: smaller stride & filter size Visualizations look better and Performance improves 11x11 filters, stride 4 7x7 filters, stride 2

Credit: R. Fergus Occlusion Experiment Mask parts of input with occluding square Monitor output of classification network Perhaps network using scene context?

Input image p(true class) Most probable class Credit: R. Fergus

Input image p(true class) Most probable class Credit: R. Fergus

Input image p(true class) Most probable class Credit: R. Fergus

Finding a Pre-Image Credit: A. Vedaldi

Images are locally smooth Credit: A. Vedaldi

Finding a pre-image at intermediat layers Credit: A. Vedaldi