HE Shuncheng March 20, 2016

Transcription

1 Department of Automation Association of Science and Technology of Automation March 20, 2016

2 Contents

3

4 Binary Figure 1: a cat? Figure 2: a dog? Binary : Given input data x (e.g. a picture), the output of a binary classifier y = f (x) is one label retrieved from a set of two labels y {±1}.

5 Linear Classifier x w T x = b x 1 Data set D = {(x (1) 1, x (1) (n) 2 ),, (x 1, x (n) 2 )} A linear binary classifier is a hyperplane w T x = b f (x) = sgn(w T x b)

6 of Linear Classifier + + x w T x = b x 1 True Positive: y = +1, f (x) = +1 True Negative: y = 1, f (x) = 1 False Positive: y = 1, f (x) = +1 False Negative: y = +1, f (x) = 1 Accuracy: TP+TN n Error Rate: FP+FN n A good classifier: minizing the error rate

7 Basic Concepts x x 1 Training Set Test Set Training Error Generalization Error Overfitting Loss Function w T x = b training error training data classifier output loss function

8 Overfitting + + x 2 + error rate generalization error x 1 training error iteration

9

10 Perceptron x 0 = 1 x 1 x 2 w 1 w 2 w 0 y x n w n n i=0 w ix i { +1, if w T x > 0 y = 1, otherwise

11 Perceptron x 1 x 2 w 1 w 2 x 0 = 1 w 0 o(x 0, x 1,, x n ) = n i=0 w ix i n i=0 w o ix i w n x n

12 Training Algorithm Define a loss function: How to reduce loss? E(w) = 1 (t d o d ) 2 2 d D y x 2

13 Gradient Descent Gradient w.r.t. w where E(w) = ( E w 0, E w 1,, E w n ) T E = (t d o d )( x (d) i ) w i d D for every iteration (α denotes learning rate) w i w i + w i w i = α E = α (t d o d )x (d) i w i d D i [n]

14 Artificial input layer x (1) 1 x (1) 2 hidden layer x (2) 1 x (2) 2 x (2) 3 output layer x (3) 1 h is a non-linear function. x l+1 = h((w l ) T x l )

15 Sigmoid Function h(x) = h(x) e x 0 1. continuous, differentiable 2. map [, + ] to [0, 1] 3. nonlinearity 4. h (x) is easy to calculate h (x) = h(x)(1 h(x)) x

16 Back Propagation and Delta Rule Please refer to this page Mathematical model of ANN x l = f (u l ), u l = (W l 1 ) T x l 1 where l denotes the current layer with the output layer designated to be layer L and the input layer desiganted to ba layer 1. Function f ( ) is a nonlinear function (i.e. sigmoid or hyperbolic tangent). Define loss function as E(x L, t) where x L is the network output and t is the target output.

17 Back Propagation and Delta Rule Expand the loss function E(x L, t) = E(f ((W L 1 ) T x L 1 ), t) Using chain rule, we can write the derivatives w.r.t. W L 1 E W L 1 = x L 1 (f (u L ) E x L )T where denotes elementwise multiplication, and if we define we get δ L = f (u L ) E x L E W L 1 = x L 1 (δ L ) T

18 Back Propagation and Delta Rule If we calculate the δ term recursively it is easy to write δ l = f (u l ) ((W l ) T δ l+1 ), l = L 1,, 2 E W l = x l (δ l+1 ) T, l = L 2,, 1

19

20 Network Structure Figure 3: structure of convolutional neural network Convolution Layer Pooling Layer (Subsampling) Full-connected Layer (Inner-product) ReLU Layer Softmax Layer

21 Convolution Layer kernel size=3 3 j+2 i+2 g ij = h st k st s=i t=j

22 Convolution Layer stride=1 j+2 i+2 g ij = h st k st s=i t=j

23 Convolution Layer Why we choose convolution? Inspired from feature extraction - using different kernels On the assumption that pixels far from each other are independent Reducing adjustable network weights to avoid overfitting

24 Pooling Layer g ij = max{h 2i,2j, h 2i+1,2j, h 2i,2j+1, h 2i+1,2j+1 } No free parameter in pooling layer.

25 Pooling Layer g ij = max{h 2i,2j, h 2i+1,2j, h 2i,2j+1, h 2i+1,2j+1 } No free parameter in pooling layer.

26 Inner-product Known as full-connected layer. Weights are designated from every input to every output, namely y = W T x

27 Rectified Linear Unit A rectifier y = max{0, x} A rectified linear unit y = ln(1 + e x ) with its derivative w.r.t. x dy dx = e x y x ReLU improves CNN performance.

28 Softmax Loss Softmax: y i (x) = e x i n, i [n] k=1 ex k Derived from softmax regression, extension of logistic regression for multinomial classfication. Outputs of softmax layer are probabilities of each label. Softmax loss function: let label j be groundtruth, therefore e x j n L = ln(y j (x)) = ln( n ) = ln( e x k ) x j k=1 ex k k=1 L x i = y i (x) δ ij where δ ij = 1 iff i = j, and δ ij = 0 otherwise.

29 MNIST Database MNIST: Mixed National Institute of Standards and Technology Figure 4: Handwritten Digits 10 distinguishing classes

30 LeNet Review Figure 5: LeNet for MNIST input: a picture (size 28 28) conv1: 4 kernels (size 5 5) pool1: max pooling (size 2 2) conv2: 3 kernels (size 5 5) pool2: max pooling (size 2 2) ip: full-connected (192 10) softmax: 10 inputs, 10 prob outputs

31

32 Stochastic gradient descent Early stop Weight decay Rectified linear unit Finetune Overlapping pooling Dropout

33 Stochastic Gradient Descent Overall loss function: Batch loss function: E(w) = 1 (t d o d ) 2 2 d D E B (w) = 1 (t d o d ) 2 2 d B where B D and B D, e.g. 256 samples out of samples. Stochastic gradient: E B w

34 Stochastic Gradient Descent SGD algorithm: α: learning rate µ: momentum Advantages of SGD: w t+1 = w t + w t w t+1 = µ w t α L w t Reducing computing resources dramatically Escaping local optimal solutions

35 Early Stop error rate stop training generalization error training error iteration Stop before generalization error rises

36 Weight Decay Regularization: In algorithm: E(w) = 1 (t d o d ) 2 + λw T W 2 d D w t+1 = µ w t α L w t αλw t

37 Rectified Linear Unit y x In terms of training time with gradient descent, saturating nonlinearities such as sigmoid and tanh are much slower than non-saturating nonlinearities.

38 Dropout Overfitting can be reduced by using dropout to prevent complex co-adaptations on the training data. On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5, so a hidden unit cannot rely on other hidden units being present.

39

40 Network in Network Network in network is a new concept in GoogLeNet, the champion of ILSVRC (ImageNet Large Scale Visual Recognition Challenge) The key structure of network in network is called Inception module. The whole network structure.

41 Network in Network Motivations to choose this structure: Deeper network tend to have stronger learning ability - especially handling with large scale datasets Dimension Curse in neural network - exponential expansion of parameters Very sparse deep neural network can be constructed layer by layer by analyzing the correlation statistics Inception module is a try to approximate a sparse structure Details about inception module: Various scale of convolution is necessary for small or big features - inspired from Gabor filter 1 1 convolution is used to reduce dimension

42 Network in Network The result of ILSVRC 2014:

43 Residual Network Encountering degradation problem when network is going deeper:

44 Residual Network An intuitive solution to construct a deep model from a shallow model: shallow model identity map identity map m layers n layers (n m) Unable to reach a better solution than the original model.

45 Residual Network Learning a residual (F(x) = H(x) x): Finally reached the depth of up to 152 layers.

46 Residual Network The result of ILSVRC 2015:

47 BNN: Binarized s Deterministic or stochastic binarization: { x b +1 if x 0, = sgn(x) = 1 otherwise. x b = { +1 with probability p = σ(x), 1 with probability 1 p. 1 σ(x) O -1 1 x

48 BNN: Binarized s Binarized conv kernel: Only 42% of the filters are unique on CIFAR-10 ConvNet.

49 BNN: Binarized s : 1. As accurate as 32-bit DNNs

50 BNN: Binarized s : 2. Less computional consumption

51 CNN Visualization Use deconvolutional network to visualize higher layer s feature map. Three keypoints are mentioned: Unpooling: approximate inverse of max pooling by recording the location of maxima Rectification: pass reconstructed signals through ReLU Filtering: use transposed version of the same filters

52 CNN Visualization Different results of visualization: Feature map in each layers Evolution of features

53 More Tremendous Development Long Short-Term Memory (LSTM) Region-based CNN (R-CNN, fast R-CNN, faster R-CNN) Deep Belief Networks (Restricted Boltzmann Machine) AlphaGo and Reinforment Learning CNN for Speech Recognition

54

55 Caffe Tutorial For more information please refer to this page. Key words: Nets, Layers and Blobs Forward / Backward Loss Solver Layer Catalogue Interfaces Data

56 Nets, Layers and Blobs Layer Net Blob : Softmax Regression

57 Forward / Backward f (w T x) L(f (w T x)) w T x L f L w x

58 Loss Softmax: y i (x) = e x i n, i [n] k=1 ex k Softmax loss function: let label j be groundtruth, therefore e x j n L = ln(y j (x)) = ln( n ) = ln( e x k ) x j k=1 ex k L x i = y i (x) δ ij where δ ij = 1 iff i = j, and δ ij = 0 otherwise. k=1

59 Solver SGD (Stochastic Gradient Descent) α: learning rate µ: momentum w t+1 = w t + w t w t+1 = µ w t α L w t

60 Solver Solver parameters (i.e.): basic learning rate: α = 0.01 learning rate policy: step (reduce learning rate according to step size) step size: gamma: 0.1 (multipy learning rate with factor 0.1 after step size) momentum: µ = 0.9 max iteration: (stop at iteration )

61 Layer Catalogue Please refer to this page. Vision layer: convolution pooling Loss layer: softmax loss Euclidean loss cross-entropy Activation layer: sigmoid ReLU hyperbolic tangent

62 Layer Catalogue Data layer: datebase in-memory HDF5 input HDF5 output Common layer: inner product splitting flatening reshape concatenation

63

64 Installation Prerequisites: protobuf, CUDA, OpenBLAS, Boost, OpenCV, lmdb, leveldb, cudnn(optional), Python(optional), numpy(optional), MATLAB(optional) Install: git clone git://github.com/bvlc/caffe /your/own/caffe/folder Go to Caffe root folder cp Makefile.config.example Makefile.config make all make test make runtest Hardware: K40, K20, Titan for ImageNet scale GTX series or GPU-equipped MacBook Pro for small datasets

65 LeNet LeNet Structure 1. Protobuf Protocol 2. Run!

66 How to be Professional? 1. Figure out theoretical keypoints (read papers) 2. Read Caffe source code 3. Be proficient at programming and debugging skills 4. Take advantage of search engine and community 5. Do it through this pipeline: Experiment design Data preparation (build database with tools) Model selection (including network and solver) Training Analysis and comparison

67 Reference I Ossama Abdel-Hamid, Li Deng, and Dong Yu. Exploring convolutional neural network structures and optimization techniques for speech recognition. In INTERSPEECH, pages , Christopher M Bishop. Pattern recognition. Machine Learning, Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages , Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arxiv preprint arxiv: , 2015.

68 Reference II Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): , Itay Hubara, Daniel Soudry, and Ran El Yaniv. Binarized neural networks. arxiv preprint arxiv: , Herbert Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the echo state network approach. GMD-Forschungszentrum Informationstechnik, 2002.

69 Reference III Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages ACM, Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14 22, 2012.

70 Reference IV Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91 99, David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587): , 2016.

71 Reference V Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1 9, Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer vision ECCV 2014, pages Springer, 2014.