Regularization of Neural Networks using DropConnect

Similar documents
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Image and Video Understanding

Introduction to Machine Learning CMU-10701

Implementation of Neural Networks with Theano.

Supporting Online Material for

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Administrivia. Traditional Recognition Approach. Overview. CMPSCI 370: Intro. to Computer Vision Deep learning

Simplified Machine Learning for CUDA. Umar

Steven C.H. Hoi School of Information Systems Singapore Management University

Programming Exercise 3: Multi-class Classification and Neural Networks

Tattoo Detection for Soft Biometric De-Identification Based on Convolutional NeuralNetworks

Novelty Detection in image recognition using IRF Neural Networks properties

Advanced analytics at your hands

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

STA 4273H: Statistical Machine Learning

Transfer Learning for Latin and Chinese Characters with Deep Neural Networks

Neural Networks and Support Vector Machines

arxiv: v2 [cs.cv] 19 Apr 2014

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

DEEP LEARNING WITH GPUS

ultra fast SOM using CUDA

Big Data Analytics CSCI 4030

Image Classification for Dogs and Cats

Recurrent Neural Networks

Deep Learning using Linear Support Vector Machines

SIGNAL INTERPRETATION

Machine Learning: Multi Layer Perceptrons

Predict Influencers in the Social Network

Linear Classification. Volker Tresp Summer 2015

CSCI567 Machine Learning (Fall 2014)

MapReduce/Bigtable for Distributed Optimization

A Hybrid Neural Network-Latent Topic Model

Data Mining. Nonlinear Classification

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Statistical Machine Learning

Data Mining Algorithms Part 1. Dejan Sarka

Taking Inverse Graphics Seriously

Efficient online learning of a non-negative sparse autoencoder

Making Sense of the Mayhem: Machine Learning and March Madness

Introduction to GPU Programming Languages

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks

A Simple Introduction to Support Vector Machines

Intelligent Heuristic Construction with Active Learning

Efficient Learning of Sparse Representations with an Energy-Based Model

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Decision Trees from large Databases: SLIQ

An Augmented Normalization Mechanism for Capacity Planning & Modelling Elegant Approach with Artificial Intelligence

GPUs for Scientific Computing

Lecture 6. Artificial Neural Networks

Using Convolutional Neural Networks for Image Recognition

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Artificial Neural Network for Speech Recognition

IBM SPSS Neural Networks 22

An Introduction to Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks

Flexible, High Performance Convolutional Neural Networks for Image Classification

Follow links Class Use and other Permissions. For more information, send to:

Do Deep Nets Really Need to be Deep?

Deep Learning with H2O. Arno Candel & Viraj Parmar

LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition

Towards running complex models on big data

Compacting ConvNets for end to end Learning

arxiv: v1 [cs.cv] 6 Feb 2015

Knowledge Discovery and Data Mining

Component Ordering in Independent Component Analysis Based on Data Power

Self Organizing Maps: Fundamentals

Anomaly detection. Problem motivation. Machine Learning

Monotonicity Hints. Abstract

Rethinking SIMD Vectorization for In-Memory Databases

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

GeoImaging Accelerator Pansharp Test Results

arxiv: v2 [cs.ne] 24 Nov 2015

Getting Started with Caffe Julien Demouth, Senior Engineer

Bayesian Dark Knowledge

Chapter 17. Orthogonal Matrices and Symmetries of Space

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

HE Shuncheng March 20, 2016

Chapter 4: Artificial Neural Networks

Bert Huang Department of Computer Science Virginia Tech

Convolutional Feature Maps

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Christfried Webers. Canberra February June 2015

Introduction to Deep Learning Variational Inference, Mean Field Theory

Bayes and Naïve Bayes. cs534-machine Learning

Pedestrian Detection with RCNN

On Optimization Methods for Deep Learning

Transcription:

Regularization of Neural Networks using DropConnect Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus Dept. of Computer Science, Courant Institute of Mathematical Science, New York University June 17, 2013 Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 1 / 19

Introduction Neural Networks are good at classifying large labeled datasets Large capacity is essential: more layers and more units But without regularization, model with millions or billions of parameters can easily overfit Existing regularization methods: l 1 or l 2 penalty Bayesian methods Early stopping of training DropOut network [Hinton et al. 2012] Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 2 / 19

Introduction Neural Networks are good at classifying large labeled datasets Large capacity is essential: more layers and more units But without regularization, model with millions or billions of parameters can easily overfit Existing regularization methods: l 1 or l 2 penalty Bayesian methods Early stopping of training DropOut network [Hinton et al. 2012] We introduce a new form of regularization: DropConnect Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 2 / 19

Overview 1 What is DropConnect Network 2 Theoretical Result about DropConnect Network 3 Experiments 4 Implementation Details 5 Conclusion Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 3 / 19

What is DropConnect Network Review of DropOut Network [Hinton et al. 2012] Stochastic dropping of units Each element of a layer s output is kept with probability p, otherwise being set to 0 with probability (1 p) Input v, weights W, activation function a(.), output r and DropOut mask m: r = m. a(wv) For every training example at every epoch has different mask m v 1 a(.) v 1 a(.) m(.) r 1 1 r 1 v 2 v 2 v 3 r 2 v 3 0 r 2 v 4 r 3 v 4 1 r 3 Normal Network DropOut Network Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 4 / 19

What is DropConnect Network DropConnect Network Applies only to fully-connected layers Randomly drop connections in network, with probability 1 p Generalization of Dropout: r = a((m. W )v) (for a(0) = 0, e.g. relu) v 1 a(.) m(.) v 1 a(.) 1 r 1 r 1 v 2 v 2 v 3 0 r 2 v 3 r 2 v 4 1 r 3 v 4 r 3 DropOut Network DropConnect Network Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 5 / 19

What is DropConnect Network DropConnect Network: Training and Inference Training For every training example at every epoch has different binary mask matrix M Backward-prop gradient uses the same M as forward-prop, for each example Use SGD with mini-batch Efficient implementation requires care Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 6 / 19

What is DropConnect Network DropConnect Network: Training and Inference Training For every training example at every epoch has different binary mask matrix M Backward-prop gradient uses the same M as forward-prop, for each example Use SGD with mini-batch Efficient implementation requires care Inference Exact solution is intractable Approximate neuron activation by Gaussian distribution Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 6 / 19

What is DropConnect Network DropConnect Network: Inference DropConnect Network Inference Exact solution requires sum over all possible 2 M masks M: r = E M [a ((M. W ) v)] = M p(m)a ((M. W ) v) Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 7 / 19

What is DropConnect Network DropConnect Network: Inference DropConnect Network Inference Exact solution requires sum over all possible 2 M masks M: r = E M [a ((M. W ) v)] = M p(m)a ((M. W ) v) Inference Approximation A single neuron u i before activation function: u i = j (W ijv j )M ij. Approximate u i by a Gaussian distribution via moment matching. where 1 p is drop connect rate u N (pwv, p (1 p) (W. W ) (v. v)) Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 7 / 19

What is DropConnect Network DropConnect Network: Inference DropConnect Network Inference Exact solution requires sum over all possible 2 M masks M: r = E M [a ((M. W ) v)] = M p(m)a ((M. W ) v) Inference Approximation A single neuron u i before activation function: u i = j (W ijv j )M ij. Approximate u i by a Gaussian distribution via moment matching. where 1 p is drop connect rate u N (pwv, p (1 p) (W. W ) (v. v)) Inference Algorithm Compute layer response at test time: r E u [a(u)] by sampling or numerical integration Each neuron activation sampled independently, thus very efficient Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 7 / 19

What is DropConnect Network Inference comparison: DropOut v.s. DropConnect DropOut Network Inference[Hinton et al. 2012] Approximate by changing the order of expectation and neuron activation: E M [a((m. W )v] a(e M (M. W )v) = a(pwv) Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 8 / 19

What is DropConnect Network Inference comparison: DropOut v.s. DropConnect DropOut Network Inference[Hinton et al. 2012] Approximate by changing the order of expectation and neuron activation: E M [a((m. W )v] a(e M (M. W )v) = a(pwv) Failure Example u N (0, 1) with a(u) = max(u, 0). a(e M (u)) = 0 while E u(a(u)) = 1/ 2π 0.4 Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 8 / 19

What is DropConnect Network Inference comparison: DropOut v.s. DropConnect DropOut Network Inference[Hinton et al. 2012] Approximate by changing the order of expectation and neuron activation: E M [a((m. W )v] a(e M (M. W )v) = a(pwv) Failure Example u N (0, 1) with a(u) = max(u, 0). a(e M (u)) = 0 while E u(a(u)) = 1/ 2π 0.4 DropConnect Network Inference Approximate by Gaussian moment matching: E M [a((m. W )v] E u [a(u)] gives the right answer for the above example Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 8 / 19

Theoretical Result about DropConnect Network Why DropConnect Regualize Network DropConnect Network Model Model averaging interpretation: a mixture model of 2 M classifiers f (x; θ) = E M [f M (x; θ, M)] = M p(m)f M (x; θ, M) Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 9 / 19

Theoretical Result about DropConnect Network Why DropConnect Regualize Network DropConnect Network Model Model averaging interpretation: a mixture model of 2 M classifiers f (x; θ) = E M [f M (x; θ, M)] = M p(m)f M (x; θ, M) Rademacher Complexity of Model Let W s be the weights of soft-max layer and W be the weights of DropConnect layer, where max W s B s, max W B. Define k is the number of classes, ˆR l (G) is the Rademacher complexity of the feature extractor (layers before DropConnect layer), n and d are the dimensionality of the input and output of the DropConnect layer respectively. ( ˆR l (F) p 2 kdb ) sn db h ˆRl (G) Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 9 / 19

Theoretical Result about DropConnect Network Why DropConnect Regualize Network DropConnect Network Model Model averaging interpretation: a mixture model of 2 M classifiers f (x; θ) = E M [f M (x; θ, M)] = M p(m)f M (x; θ, M) Rademacher Complexity of Model Let W s be the weights of soft-max layer and W be the weights of DropConnect layer, where max W s B s, max W B. Define k is the number of classes, ˆR l (G) is the Rademacher complexity of the feature extractor (layers before DropConnect layer), n and d are the dimensionality of the input and output of the DropConnect layer respectively. ( ˆR l (F) p 2 kdb ) sn db h ˆRl (G) Special Cases of p 1 p = 0: the model complexity is zero, since the input has no influence on the output. 2 p = 1: it returns to the complexity of a standard model. 3 p = 1/2: all sub-models have equal preference. Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 9 / 19

Experiments overview Experiments DataSet: MNIST, CIFAR-10, and SVNH Kept Rate p = 0.5 for both DropOut and DropcConnect network (except where explicitly stated) Normal Network, DropOut Network and DropConnect Network have: We use: exactly the same architecture exactly the same model/training parameters exactly the same data augmentation algorithm Data Augmentation: Translation, Rotation and Scaling Aggregating multiple models: 5 or more Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 10 / 19

Varying Size of Network Experiments MNIST test error with two hidden layer network (p = 0.5) 2.1 2 1.9 No Drop Dropout DropConnect 1.8 % Test Error 1.7 1.6 1.5 1.4 1.3 1.2 1.1 200 400 800 1600 Hidden Units No-Drop Network overfits with a large number of neurons Both DropConnect and DropOut regularize model nicely DropConnect performance better than DropOut and No-Drop Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 11 / 19

Varying Kept Rate p Experiments MNIST test error with two hidden layer network with 400 neurons each 2.4 % Test Error 2.2 2 1.8 1.6 Dropout (mean) DropConnect (mean) Dropout (sampling) DropConnect (sampling) 1.4 1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 % of Elements Kept Optimal value of p is roughly at 0.5 for both DropOut and DropConnect DropConnect: Sampling inference works better than mean inference Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 12 / 19

Experiments Comparison of Convergence Rates MNIST test error with two hidden layer network with 400 neurons each Cross Entropy 10 2 10 3 No Drop Train No Drop Test Dropout Train Dropout Test DropConnect Train DropConnect Test 100 200 300 400 500 600 700 800 900 Epoch Cross Entropy, continuous measure of error DropConnect slower than DropOut but better test error Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 13 / 19

MNIST Results(1) Experiments MNIST 784-800-800-10 network classification error rate without data augmentation: neuron model error(%) 5 network voting error(%) relu No-Drop 1.62 ± 0.037 1.40 Dropout 1.28 ± 0.040 1.20 DropConnect 1.20 ± 0.034 1.12 sigmoid No-Drop 1.78 ± 0.037 1.74 Dropout 1.38 ± 0.039 1.36 DropConnect 1.55 ± 0.046 1.48 tanh No-Drop 1.65 ± 0.026 1.49 Dropout 1.58 ± 0.053 1.55 DropConnect 1.36 ± 0.054 1.35 relu neuron is always performs better than other neuron Error Rate: DropConnect < DropOut < No-Drop for relu and tanh DropOut works best for sigmod which does not have a(0) = 0. Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 14 / 19

MNIST Results(2) Experiments MNIST classification error crop rotation scaling model error(%) 5 network voting error(%) no no No-Drop 0.77 ± 0.051 0.67 Dropout 0.59 ± 0.039 0.52 DropConnect 0.63 ± 0.035 0.57 yes no No-Drop 0.50 ± 0.098 0.38 Dropout 0.39 ± 0.039 0.35 DropConnect 0.39 ± 0.047 0.32 yes yes No-Drop 0.30 ± 0.035 0.21 Dropout 0.28 ± 0.016 0.27 DropConnect 0.28 ± 0.032 0.21 Previous state-of-the-art is: 0.21% is the new state-of-the-art 0.45% for a single model without elastic distortions [Goodfellow et al. 2013] 0.23% with elastic distortions and voting [Ciresan et al. 2012] Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 15 / 19

CIFAR-10 Results Experiments 10 classes of 32 32 colored images 50K train/class 10K test/class CIFAR-10 classification error model error(%) 5 network voting error(%) No-Drop 11.18± 0.13 10.22 Dropout 11.52± 0.18 9.83 DropConnect 11.10± 0.13 9.41 Voting with 12 DropConnect networks produces a new state-of-the-art of 9.32% Previous state-of-the-art is: 11.21% [Ciresan et al. 2012] 9.5% [Snoek et al. 2012] 9.38% [Goodfellow et al. 2013] Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 16 / 19

SVHN Results Experiments 10 classes of 32 32 colored image 604,388 training images (both training set and extra set) 26,032 testing images large variety of colors and brightness SVHN classification error Previous state-of-the-art is: model error(%) 5 network voting error(%) No-Drop 2.26 ± 0.072 1.94 Dropout 2.25 ± 0.034 1.96 DropConnect 2.23 ± 0.039 1.94 1.94% is the new state-of-the-art 2.8% stochastic pooling[zeiler et al. 2013] 2.47% maxout network[goodfellow et al. 2013] Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 17 / 19

Implementation Details How-to Implement DropConnect Layer Implementation Mask Weight Time(ms) Speedup fprop bprop acts bprop weights total CPU float 480.2 1228.6 1692.8 3401.6 1.0 CPU bit 392.3 679.1 759.7 1831.1 1.9 GPU float(global memory) 21.6 6.2 7.2 35.0 97.2 GPU float(tex1d memory) 15.1 6.1 6.0 27.2 126.0 GPU bit(tex2d aligned memory) 2.4 2.7 3.1 8.2 414.8 NVidia GTX580 GPU relative to a 2.67Ghz Intel Xeon (compiled with -O3 flag). Input and output dimension: 1024 and mini-batch size: 128 Tricks: encode connection information with bits bind mask weight matrix to 2D texture memory CUDA code avaiable at http://cs.nyu.edu/~wanli/dropc/ Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 18 / 19

Conclusion Conclusion We introduced DropConnect Network A simple stochastic regularization algorithm for neural network Generalization of DropOut Only effective on fully-connected layers Set new state-of-the-art on three popular data sets Wan et al. (NYU) Regularization of Neural Networks using DropConnect June 17, 2013 19 / 19