Department of Automation Association of Science and Technology of Automation March 20, 2016
Contents
Binary Figure 1: a cat? Figure 2: a dog? Binary : Given input data x (e.g. a picture), the output of a binary classifier y = f (x) is one label retrieved from a set of two labels y {±1}.
Linear Classifier x 2 + + + + + + + w T x = b x 1 Data set D = {(x (1) 1, x (1) (n) 2 ),, (x 1, x (n) 2 )} A linear binary classifier is a hyperplane w T x = b f (x) = sgn(w T x b)
of Linear Classifier + + x 2 + + + + + w T x = b x 1 True Positive: y = +1, f (x) = +1 True Negative: y = 1, f (x) = 1 False Positive: y = 1, f (x) = +1 False Negative: y = +1, f (x) = 1 Accuracy: TP+TN n Error Rate: FP+FN n A good classifier: minizing the error rate
Basic Concepts x 2 + + + + + + + x 1 Training Set Test Set Training Error Generalization Error Overfitting Loss Function w T x = b training error training data classifier output loss function
Overfitting + + x 2 + error rate generalization error + + + + x 1 training error iteration
Perceptron x 0 = 1 x 1 x 2 w 1 w 2 w 0 y x n w n n i=0 w ix i { +1, if w T x > 0 y = 1, otherwise
Perceptron x 1 x 2 w 1 w 2 x 0 = 1 w 0 o(x 0, x 1,, x n ) = n i=0 w ix i n i=0 w o ix i w n x n
Training Algorithm Define a loss function: How to reduce loss? E(w) = 1 (t d o d ) 2 2 d D 5 0 2 1 y 0 1 2 2 0 x 2
Gradient Descent Gradient w.r.t. w where E(w) = ( E w 0, E w 1,, E w n ) T E = (t d o d )( x (d) i ) w i d D for every iteration (α denotes learning rate) w i w i + w i w i = α E = α (t d o d )x (d) i w i d D i [n]
Artificial input layer x (1) 1 x (1) 2 hidden layer x (2) 1 x (2) 2 x (2) 3 output layer x (3) 1 h is a non-linear function. x l+1 = h((w l ) T x l )
Sigmoid Function h(x) = h(x) 1 1 1 + e x 0 1. continuous, differentiable 2. map [, + ] to [0, 1] 3. nonlinearity 4. h (x) is easy to calculate h (x) = h(x)(1 h(x)) x
Back Propagation and Delta Rule Please refer to this page Mathematical model of ANN x l = f (u l ), u l = (W l 1 ) T x l 1 where l denotes the current layer with the output layer designated to be layer L and the input layer desiganted to ba layer 1. Function f ( ) is a nonlinear function (i.e. sigmoid or hyperbolic tangent). Define loss function as E(x L, t) where x L is the network output and t is the target output.
Back Propagation and Delta Rule Expand the loss function E(x L, t) = E(f ((W L 1 ) T x L 1 ), t) Using chain rule, we can write the derivatives w.r.t. W L 1 E W L 1 = x L 1 (f (u L ) E x L )T where denotes elementwise multiplication, and if we define we get δ L = f (u L ) E x L E W L 1 = x L 1 (δ L ) T
Back Propagation and Delta Rule If we calculate the δ term recursively it is easy to write δ l = f (u l ) ((W l ) T δ l+1 ), l = L 1,, 2 E W l = x l (δ l+1 ) T, l = L 2,, 1
Network Structure Figure 3: structure of convolutional neural network Convolution Layer Pooling Layer (Subsampling) Full-connected Layer (Inner-product) ReLU Layer Softmax Layer
Convolution Layer kernel size=3 3 j+2 i+2 g ij = h st k st s=i t=j
Convolution Layer stride=1 j+2 i+2 g ij = h st k st s=i t=j
Convolution Layer Why we choose convolution? Inspired from feature extraction - using different kernels On the assumption that pixels far from each other are independent Reducing adjustable network weights to avoid overfitting
Pooling Layer g ij = max{h 2i,2j, h 2i+1,2j, h 2i,2j+1, h 2i+1,2j+1 } No free parameter in pooling layer.
Pooling Layer g ij = max{h 2i,2j, h 2i+1,2j, h 2i,2j+1, h 2i+1,2j+1 } No free parameter in pooling layer.
Inner-product Known as full-connected layer. Weights are designated from every input to every output, namely y = W T x
Rectified Linear Unit A rectifier y = max{0, x} A rectified linear unit y = ln(1 + e x ) with its derivative w.r.t. x dy dx = 1 1 + e x y x ReLU improves CNN performance.
Softmax Loss Softmax: y i (x) = e x i n, i [n] k=1 ex k Derived from softmax regression, extension of logistic regression for multinomial classfication. Outputs of softmax layer are probabilities of each label. Softmax loss function: let label j be groundtruth, therefore e x j n L = ln(y j (x)) = ln( n ) = ln( e x k ) x j k=1 ex k k=1 L x i = y i (x) δ ij where δ ij = 1 iff i = j, and δ ij = 0 otherwise.
MNIST Database MNIST: Mixed National Institute of Standards and Technology Figure 4: Handwritten Digits 10 distinguishing classes
LeNet Review Figure 5: LeNet for MNIST input: a picture (size 28 28) conv1: 4 kernels (size 5 5) pool1: max pooling (size 2 2) conv2: 3 kernels (size 5 5) pool2: max pooling (size 2 2) ip: full-connected (192 10) softmax: 10 inputs, 10 prob outputs
Stochastic gradient descent Early stop Weight decay Rectified linear unit Finetune Overlapping pooling Dropout
Stochastic Gradient Descent Overall loss function: Batch loss function: E(w) = 1 (t d o d ) 2 2 d D E B (w) = 1 (t d o d ) 2 2 d B where B D and B D, e.g. 256 samples out of 100000 samples. Stochastic gradient: E B w
Stochastic Gradient Descent SGD algorithm: α: learning rate µ: momentum Advantages of SGD: w t+1 = w t + w t w t+1 = µ w t α L w t Reducing computing resources dramatically Escaping local optimal solutions
Early Stop error rate stop training generalization error training error iteration Stop before generalization error rises
Weight Decay Regularization: In algorithm: E(w) = 1 (t d o d ) 2 + λw T W 2 d D w t+1 = µ w t α L w t αλw t
Rectified Linear Unit y x In terms of training time with gradient descent, saturating nonlinearities such as sigmoid and tanh are much slower than non-saturating nonlinearities.
Dropout Overfitting can be reduced by using dropout to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.
Network in Network Network in network is a new concept in GoogLeNet, the champion of ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2014. The key structure of network in network is called Inception module. The whole network structure.
Network in Network Motivations to choose this structure: Deeper network tend to have stronger learning ability - especially handling with large scale datasets Dimension Curse in neural network - exponential expansion of parameters Very sparse deep neural network can be constructed layer by layer by analyzing the correlation statistics Inception module is a try to approximate a sparse structure Details about inception module: Various scale of convolution is necessary for small or big features - inspired from Gabor filter 1 1 convolution is used to reduce dimension
Network in Network The result of ILSVRC 2014:
Residual Network Encountering degradation problem when network is going deeper:
Residual Network An intuitive solution to construct a deep model from a shallow model: shallow model identity map identity map m layers n layers (n m) Unable to reach a better solution than the original model.
Residual Network Learning a residual (F(x) = H(x) x): Finally reached the depth of up to 152 layers.
Residual Network The result of ILSVRC 2015:
BNN: Binarized s Deterministic or stochastic binarization: { x b +1 if x 0, = sgn(x) = 1 otherwise. x b = { +1 with probability p = σ(x), 1 with probability 1 p. 1 σ(x) O -1 1 x
BNN: Binarized s Binarized conv kernel: Only 42% of the filters are unique on CIFAR-10 ConvNet.
BNN: Binarized s : 1. As accurate as 32-bit DNNs
BNN: Binarized s : 2. Less computional consumption
CNN Visualization Use deconvolutional network to visualize higher layer s feature map. Three keypoints are mentioned: Unpooling: approximate inverse of max pooling by recording the location of maxima Rectification: pass reconstructed signals through ReLU Filtering: use transposed version of the same filters
CNN Visualization Different results of visualization: Feature map in each layers Evolution of features
More Tremendous Development Long Short-Term Memory (LSTM) Region-based CNN (R-CNN, fast R-CNN, faster R-CNN) Deep Belief Networks (Restricted Boltzmann Machine) AlphaGo and Reinforment Learning CNN for Speech Recognition
Caffe Tutorial For more information please refer to this page. Key words: Nets, Layers and Blobs Forward / Backward Loss Solver Layer Catalogue Interfaces Data
Nets, Layers and Blobs Layer Net Blob : Softmax Regression
Forward / Backward f (w T x) L(f (w T x)) w T x L f L w x
Loss Softmax: y i (x) = e x i n, i [n] k=1 ex k Softmax loss function: let label j be groundtruth, therefore e x j n L = ln(y j (x)) = ln( n ) = ln( e x k ) x j k=1 ex k L x i = y i (x) δ ij where δ ij = 1 iff i = j, and δ ij = 0 otherwise. k=1
Solver SGD (Stochastic Gradient Descent) α: learning rate µ: momentum w t+1 = w t + w t w t+1 = µ w t α L w t
Solver Solver parameters (i.e.): basic learning rate: α = 0.01 learning rate policy: step (reduce learning rate according to step size) step size: 100000 gamma: 0.1 (multipy learning rate with factor 0.1 after step size) momentum: µ = 0.9 max iteration: 350000 (stop at iteration 350000)
Layer Catalogue Please refer to this page. Vision layer: convolution pooling Loss layer: softmax loss Euclidean loss cross-entropy Activation layer: sigmoid ReLU hyperbolic tangent
Layer Catalogue Data layer: datebase in-memory HDF5 input HDF5 output Common layer: inner product splitting flatening reshape concatenation
Installation Prerequisites: protobuf, CUDA, OpenBLAS, Boost, OpenCV, lmdb, leveldb, cudnn(optional), Python(optional), numpy(optional), MATLAB(optional) Install: git clone git://github.com/bvlc/caffe /your/own/caffe/folder Go to Caffe root folder cp Makefile.config.example Makefile.config make all make test make runtest Hardware: K40, K20, Titan for ImageNet scale GTX series or GPU-equipped MacBook Pro for small datasets
LeNet LeNet Structure 1. Protobuf Protocol 2. Run!
How to be Professional? 1. Figure out theoretical keypoints (read papers) 2. Read Caffe source code 3. Be proficient at programming and debugging skills 4. Take advantage of search engine and community 5. Do it through this pipeline: Experiment design Data preparation (build database with tools) Model selection (including network and solver) Training Analysis and comparison
Reference I Ossama Abdel-Hamid, Li Deng, and Dong Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In INTERSPEECH, pages 3366 3370, 2013. Christopher M Bishop. Pattern recognition. Machine Learning, 2006. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440 1448, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arxiv preprint arxiv:1512.03385, 2015.
Reference II Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. Itay Hubara, Daniel Soudry, and Ran El Yaniv. Binarized neural networks. arxiv preprint arxiv:1602.02505, 2016. Herbert Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach. GMD-Forschungszentrum Informationstechnik, 2002.
Reference III Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675 678. ACM, 2014. Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14 22, 2012.
Reference IV Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91 99, 2015. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.
Reference V Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, 2015. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer vision ECCV 2014, pages 818 833. Springer, 2014.