CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Similar documents
Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Image and Video Understanding

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Image Classification for Dogs and Cats

Convolutional Feature Maps

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Introduction to Machine Learning CMU-10701

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Deformable Part Models with CNN Features

Learning and transferring mid-level image representions using convolutional neural networks

Compacting ConvNets for end to end Learning

arxiv: v6 [cs.cv] 10 Apr 2015

Steven C.H. Hoi School of Information Systems Singapore Management University

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Pedestrian Detection with RCNN

Fast R-CNN Object detection with Caffe

Bert Huang Department of Computer Science Virginia Tech

Pedestrian Detection using R-CNN

CAP 6412 Advanced Computer Vision

Object Detection in Video using Faster R-CNN

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

Latest Advances in Deep Learning. Yao Chou

Novelty Detection in image recognition using IRF Neural Networks properties

A Dynamic Convolutional Layer for Short Range Weather Prediction

Implementation of Neural Networks with Theano.

Supporting Online Material for

SIGNAL INTERPRETATION

Semantic Recognition: Object Detection and Scene Segmentation

Multi-Column Deep Neural Network for Traffic Sign Classification

Learning to Process Natural Language in Big Data Environment

arxiv: v2 [cs.cv] 19 Jun 2015

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

arxiv: v1 [cs.cv] 29 Apr 2016

Taking Inverse Graphics Seriously

arxiv: v2 [cs.cv] 19 Apr 2014

Do Convnets Learn Correspondence?

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

ImageNet Classification with Deep Convolutional Neural Networks

Fast Accurate Fish Detection and Recognition of Underwater Images with Fast R-CNN

Selecting Receptive Fields in Deep Networks

Multi-view Face Detection Using Deep Convolutional Neural Networks

Deep Residual Networks

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

HE Shuncheng March 20, 2016

arxiv: v1 [cs.cv] 6 Feb 2015

arxiv: v1 [cs.cv] 18 May 2015

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

InstaNet: Object Classification Applied to Instagram Image Streams

Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling

Deep Learning using Linear Support Vector Machines

Real-Time Grasp Detection Using Convolutional Neural Networks

TRAFFIC sign recognition has direct real-world applications

Task-driven Progressive Part Localization for Fine-grained Recognition

Exploit All the Layers: Fast and Accurate CNN Object Detector with Scale Dependent Pooling and Cascaded Rejection Classifiers

Applications of Deep Learning to the GEOINT mission. June 2015

PANDA: Pose Aligned Networks for Deep Attribute Modeling

Network Morphism. Abstract. 1. Introduction. Tao Wei

The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization

arxiv: v2 [cs.cv] 27 Sep 2015

Flexible, High Performance Convolutional Neural Networks for Image Classification

Simplified Machine Learning for CUDA. Umar

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

arxiv: v4 [cs.cv] 2 Apr 2015

Learning Invariant Features through Topographic Filter Maps

Keypoint Density-based Region Proposal for Fine-Grained Object Detection and Classification using Regions with Convolutional Neural Network Features

Università degli Studi di Bologna

R-CNN minus R. 1 Introduction. Karel Lenc Department of Engineering Science, University of Oxford, Oxford, UK.

Tutorial on Deep Learning and Applications

Going Deeper with Convolutional Neural Network for Intelligent Transportation

Using Convolutional Neural Networks for Image Recognition

Simultaneous Deep Transfer Across Domains and Tasks

Taking a Deeper Look at Pedestrians

Neural Networks and Support Vector Machines

Denoising Convolutional Autoencoders for Noisy Speech Recognition

Improving Deep Neural Network Performance by Reusing Features Trained with Transductive Transference

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Two-Stream Convolutional Networks for Action Recognition in Videos

arxiv: v3 [cs.cv] 8 Nov 2015

The Visual Internet of Things System Based on Depth Camera

Scalable Object Detection by Filter Compression with Regularized Sparse Coding

Privacy-CNH: A Framework to Detect Photo Privacy with Convolutional Neural Network Using Hierarchical Features

CNN Based Object Detection in Large Video Images. WangTao, IQIYI ltd

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Do Deep Nets Really Need to be Deep?

An Introduction to Deep Learning

arxiv:submit/ [cs.cv] 13 Apr 2016

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Implementing Deep Neural Networks with Non Volatile Memories

Advanced analytics at your hands

Understanding Deep Image Representations by Inverting Them

An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048

arxiv: v1 [cs.cv] 15 May 2014

Getting Started with Caffe Julien Demouth, Senior Engineer

Image denoising: Can plain Neural Networks compete with BM3D?

Transcription:

CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Today: Deep neural networks Background Architectures and basic operations Applications Visualizing deep neural networks Breaking deep learning Packages

Deep neural network Figure from http://neuralnetworksanddeeplearning.com/chap5.html

Quick summary Take model trained on, e.g., ImageNet 2012 training set Take outputs of 6 th or 7 th layer Optional: fine-tune features and/or classifier on new dataset To train model from scratch, need lots of data Classify test set of new dataset Why now: We have lots of data, and deep nets can be trained in reasonable time with GPUs; Language: deep learning ~ deep neural nets ~ convolutional neural nets Adapted from Lana Lazebnik

Traditional Recognition Approach Image/ Video Pixels Hand-designed feature extraction (e.g. SIFT, HOG) Trainable classifier Object Class Features are key to recent progress in recognition, but flawed Where next? Better classifiers? Or keep building more features? Adapted from Lana Lazebnik

What about learning the features? Learn a feature hierarchy all the way from pixels to classifier Each layer extracts features from the output of previous layer Train all layers jointly Image/ Video Pixels Layer 1 Layer 2 Layer 3 Simple Classifier Lana Lazebnik

Shallow vs. deep architectures Traditional recognition: Shallow architecture Image/ Video Pixels Hand-designed feature extraction Trainable classifier Object Class Deep learning: Deep architecture Image/ Video Pixels Layer 1 Layer N Simple classifier Object Class Lana Lazebnik

Background: Perceptrons Input x 1 Weights w 1 x 2 x 3... x d w 2 w 3 w d Output: (w x + b) Sigmoid function: ( t) 1 1 e t Lana Lazebnik

Background: Multi-Layer Neural Networks Nonlinear classifier Training: find network weights w to minimize the error between true training labels y i and estimated labels f w (x i ): y i fw ( i ) E( w) x i 1 Minimization can be done by gradient descent provided f is differentiable Lana Lazebnik This training method is called back-propagation N 2

Deep neural network

Lana Lazebnik Inspiration: Neuron cells

Biological neuron and Perceptrons A biological neuron An artificial neuron (Perceptron) - a linear classifier Jia-bin Huang

Hubel/Wiesel Architecture D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981) Visual cortex consists of a hierarchy of simple, complex, and hyper-complex cells Hierarchy of feature detectors in the visual cortex, with higher level features responding to patterns of activation in lower level cells, and propagating activation upwards to still higher level cells Lana Lazebnik, Jia-bin Huang Source

Hubel/Wiesel Architecture and Multi-layer Neural Network Hubel and Weisel s architecture Multi-layer Neural Network - A non-linear classifier Jia-bin Huang

Today: Deep neural networks Background Architectures and basic operations Applications Visualizing deep neural networks Breaking deep learning Packages

Convolutional Neural Networks (CNN) Neural network with specialized connectivity structure Stack multiple stages of feature extractors Higher stages compute more global, more invariant, more abstract features Classification layer at the end Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278 2324, 1998. Adapted from Rob Fergus

Convolutional Neural Networks (CNN) Feed-forward feature extraction: 1. Convolve input with learned filters 2. Apply non-linearity 3. Spatial pooling (downsample) 4. Normalization (optional) Supervised training of convolutional filters by back-propagating classification error Feature maps Normalization Spatial pooling Non-linearity Convolution (Learned) Adapted from Lana Lazebnik Input Image

1. Convolution Apply learned filter weights One feature map per filter Stride can be greater than 1 (faster, less memory)... Adapted from Rob Fergus Input Feature Map

2. Non-Linearity Per-element (independent) Options: Tanh Sigmoid: 1/(1+exp(-x)) Rectified linear unit (ReLU) Simplifies backpropagation Makes learning faster Avoids saturation issues Preferred option (works well) Rob Fergus

3. Spatial Pooling Sum or max Non-overlapping / overlapping regions Role of pooling: Invariance to small transformations Larger receptive fields (see more of input) Rob Fergus, figure from Andrej Karpathy

4. Normalization Within or across feature maps Before or after spatial pooling Feature Maps Feature Maps After Contrast Normalization Rob Fergus

Compare: SIFT Descriptor Image Pixels Apply oriented filters Lowe [IJCV 2004] Spatial pool (Sum) Normalize to unit length Feature Vector Rob Fergus

A common CNN architecture (AlexNet) Figure from http://www.mdpi.com/2072-4292/7/11/14680/htm

Training Convolutional Neural Networks Backpropagation + stochastic gradient descent Neural Networks: Tricks of the Trade Initialization Transfer learning Dropout Data augmentation Adapted from Jia-bin Huang

Dropout Randomly turn off some neurons Allows individual neurons to independently be responsible for performance Adapted from Jia-bin Huang Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014]

Data Augmentation (Jittering) Create virtual training samples Horizontal flip Random crop Color casting Geometric distortion Jia-bin Huang Deep Image [Wu et al. 2015]

Today: Deep neural networks Background Architectures and basic operations Applications Visualizing deep neural networks Breaking deep learning Packages

Convnet Successes Handwritten text/digits MNIST (0.17% error [Ciresan et al. 2011]) Arabic & Chinese [Ciresan et al. 2012] Simpler recognition benchmarks CIFAR-10 (9.3% error [Wan et al. 2013]) Traffic sign recognition 0.56% error vs 1.16% for humans [Ciresan et al. 2011] But until recently, less good at more complex datasets Caltech-101/256 (few training examples) Lana Lazebnik

ImageNet Challenge 2012 Validation classification Validation classification Validation classification ~14 million labeled images, 20k classes Images gathered from Internet Human labels via Amazon Turk [Deng et al. CVPR 2009] Challenge: 1.2 million training images, 1000 classes A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 Lana Lazebnik

ImageNet Challenge 2012 AlexNet: Similar framework to LeCun 98 but: Bigger model (7 hidden layers, 650,000 units, 60,000,000 params) More data (10 6 vs. 10 3 images) GPU implementation (50x speedup over CPU) Trained on two GPUs for a week Better regularization for training (DropOut) A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 Lana Lazebnik

Top-5 error rate % ImageNet Challenge 2012 Krizhevsky et al. -- 16.4% error (top-5) Next best (non-convnet) 26.2% error 35 30 25 20 15 10 5 0 SuperVision ISI Oxford INRIA Amsterdam Lana Lazebnik

ImageNet Challenge 2012-2014 Best non-convnet in 2012: 26.2% Team Year Place Error (top-5) SuperVision Toronto (7 layers) 2012-16.4% Clarifai NYU (7 layers) 2013-11.7% VGG Oxford (16 layers) 2014 2nd 7.32% GoogLeNet (19 layers) 2014 1st 6.67% Human expert* 5.1% Jia-bin Huang

Results on misc. benchmarks [1] Caltech-101 (30 samples per class) [1] SUN 397 dataset (DeCAF) [1] Caltech-UCSD Birds (DeCAF) [2] MIT-67 Indoor Scenes dataset (OverFeat) [1] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, arxiv preprint, 2014 [2] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition, arxiv preprint, 2014 Lana Lazebnik

CNN features for detection Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (map) of 53.7% on PASCAL VOC 2010. For comparison, Uijlings et al. (2013) report 35.1% map using the same region proposals, but with a spatial pyramid and bag-of-visualwords approach. The popular deformable part models perform at 33.4%. R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014. Lana Lazebnik

Transfer Learning Improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned Jia-bin Huang Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al. CVPR 2014]

Beyond classification Detection Segmentation Regression Pose estimation Matching patches Synthesis and many more Jia-bin Huang

Photographer identification Who took this photograph? Deep net features achieve 74% accuracy Chance is less than 3%, human performance is 47% Method learns what proto-objects + scenes authors shoot Can use this to develop field guides for human use Can generate novel photographs by given author Thomas and Kovashka, submitted to CVPR 2016

Photographer identification Some misclassifications Thomas and Kovashka, submitted to CVPR 2016

Today: Deep neural networks Background Architectures and basic operations Applications Visualizing deep neural networks Breaking deep learning Packages

Layer 1 Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

Layer 2 Patches from validation images that give maximal activation of a given feature map Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

Layer 2

Layer 3 Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

Layer 3: Top-9 Patches Layer 3

Layer 4 and 5 Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]

Evolution of Features During Training Lana Lazebnik

Evolution of Features During Training Lana Lazebnik

Diagnosing Problems Visualization of Krizhevsky et al. s architecture showed some problems with layers 1 and 2 Large stride of 4 used Alter architecture: smaller stride & filter size Visualizations look better Performance improves Krizhevsky et al. Zeiler and Fergus Lana Lazebnik

How important is depth? Softmax Output Architecture of Krizhevsky et al. 8 layers total Trained on ImageNet 18.1% top-5 error Layer 7: Full Layer 6: Full Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Lana Lazebnik Input Image

How important is depth? Softmax Output Remove top fully connected layer Layer 7 Drop 16 million parameters Layer 6: Full Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Only 1.1% drop in performance! Layer 2: Conv + Pool Layer 1: Conv + Pool Lana Lazebnik Input Image

How important is depth? Softmax Output Remove both fully connected layers Layer 6 & 7 Drop ~50 million parameters 5.7% drop in performance Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Lana Lazebnik Input Image

How important is depth? Softmax Output Now try removing upper feature extractor layers: Layers 3 & 4 Drop ~1 million parameters Layer 7: Full Layer 6: Full Layer 5: Conv + Pool 3.0% drop in performance Layer 2: Conv + Pool Layer 1: Conv + Pool Lana Lazebnik Input Image

How important is depth? Softmax Output Now try removing upper feature extractor layers & fully connected: Layers 3, 4, 6,7 Now only 4 layers Layer 5: Conv + Pool 33.5% drop in performance Depth of network is key Layer 2: Conv + Pool Layer 1: Conv + Pool Lana Lazebnik Input Image

Tapping off Features at each Layer Plug features from each layer into linear SVM Features are neuron activations at that level Adapted from Matt Zeiler

Breaking CNNs Jia-bin Huang Intriguing properties of neural networks [Szegedy ICLR 2014]

Breaking CNNs Jia-bin Huang Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images [Nguyen et al. CVPR 2015]

Fooling a linear classifier To fool a linear classifier, add a small multiple of the weight vector to the training example: x x + αw Jia-bin Huang http://karpathy.github.io/2015/03/30/breaking-convnets/

Today: Deep neural networks Background Architectures and basic operations Applications Visualizing deep neural networks Breaking deep learning Packages

Packages Caffe (new version of Decaf) cuda-convnet2 Torch Theano MatConvNet Pylearn2 Overfeat

Tutorials http://deeplearning.net/ http://cs231n.stanford.edu

Things to remember Overview Neuroscience, perceptron, multi-layer neural networks Convolutional neural network (CNN) Convolution, nonlinearity, max pooling CNN for classification and beyond Understanding and visualizing CNN Find images that maximize some class scores; visualize individual neuron activation and input patterns; breaking CNNs Training CNN Dropout; data augmentation; transfer learning Using CNNs for your own task Basic first step: try the pre-trained CaffeNet fc6-fc8 layers as features Adapted from Jia-bin Huang