Deep Residual Networks

Transcription

1 Deep Residual Networks Deep Learning Gets Way Deeper 8:30-10:30am, June 19 ICML 2016 tutorial Kaiming He Facebook AI Research* *as of July Formerly affiliated with Microsoft Research Asia 7x7 conv, 64, /2, pool/2 1x1 conv, 64 1x1 conv, 64 1x1 conv, 64 1x1 conv, 128, /2 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512, /2 1x1 conv, 512, /2 1x1 conv, x1 conv, 512 1x1 conv, x1 conv, 512 1x1 conv, 2048 ave pool, fc 1000

2 Overview Introduction Background From shallow to deep Deep Residual Networks From 10 layers to 100 layers From 100 layers to 1000 layers Applications Q & A

3 Introduction

4 Introduction Deep Residual Networks (ResNets) Deep Residual Learning for Image Recognition. CVPR 2016 (next week) A simple and clean framework of training very deep nets State-of-the-art performance for Image classification Object detection Semantic segmentation and more Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

5 ILSVRC & COCO 2015 Competitions 1st places in all five main tracks ImageNet Classification: Ultra-deep 152-layer nets ImageNet Detection: 16% better than 2nd ImageNet Localization: 27% better than 2nd COCO Detection: 11% better than 2nd COCO Segmentation: 12% better than 2nd *improvements are relative numbers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

6 Revolution of Depth 152 layers layers 19 layers layers 8 layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

7 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384, pool/2 fc, 4096 fc, 4096 fc, 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

8 Revolution of Depth soft max2 Soft maxact ivat ion FC AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 VGG, 19 layers (ILSVRC 2014), pool/2 GoogleNet, 22 layers (ILSVRC 2014) AveragePool 7x7+ 1(V) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) soft max1 3x3 conv, 384, pool/2 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) Soft maxact ivat ion FC, pool/2 Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) FC Conv 1x1+ 1(S) fc, 4096 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat AveragePool 5x5+ 3(V) fc, 4096 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) fc, 1000, pool/2 Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) soft max0 Soft maxact ivat ion Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat FC FC Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Conv 1x1+ 1(S) AveragePool 5x5+ 3(V), pool/2, pool/2 fc, 4096 fc, 4096 Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 3x3+ 1(S) Conv 1x1+ 1(V) LocalRespNorm MaxPool 3x3+ 2(S) fc, 1000 input Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR Conv 7x7+ 2(S)

9 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384, pool/2 fc, 4096 fc, 4096 fc, 1000, pool/2, pool/2, pool/2, pool/2, pool/2 fc, 4096 fc, 4096 fc, x7 conv, 64, /2, pool/2 1x1 conv, 64 1x1 conv, 64 1x1 conv, 64 1x1 conv, 128, /2 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512 1x1 conv, 128 1x1 conv, 512, /2 1x1 conv, 512, /2 1x1 conv, x1 conv, 512 1x1 conv, x1 conv, 512 1x1 conv, 2048 ave pool, fc 1000 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) VGG, 19 layers (ILSVRC 2014) ResNet, 152 layers (ILSVRC 2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

10 Revolution of Depth 101 layers Engines of visual recognition shallow 8 layers 16 layers HOG, DPM AlexNet (RCNN) VGG (RCNN) ResNet (Faster RCNN)* PASCAL VOC 2007 Object Detection map (%) *w/ other improvements & more data Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

11 ResNet s object detection result on COCO *the original image is from the COCO dataset Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

12 Very simple, easy to follow Many third-party implementations (list in Facebook AI Research s Torch ResNet: Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code Torch, MNIST, 100 layers: blog, code A winning entry in Kaggle's right whale recognition challenge: blog, code Neon, Place2 (mini), 40 layers: blog, code Easily reproduced results (e.g. Torch ResNet: A series of extensions and follow-ups > 200 citations in 6 months after posted on arxiv (Dec. 2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

13 Background From shallow to deep

14 Traditional recognition But what s next? shallower pixels classifier bus? edges classifier bus? SIFT/HOG deeper edges histogram classifier bus? edges K-means/ sparse code histogram classifier bus?

15 Deep Learning Specialized components, domain knowledge required edges K-means/ sparse code histogram classifier bus? Generic components ( layers ), less domain knowledge bus? Repeat elementary layers => Going deeper bus? End-to-end learning Richer solution space

16 Spectrum of Depth 5 layers: easy >10 layers: initialization, Batch Normalization >30 layers: skip connections >100 layers: identity skip connections >1000 layers:? shallower deeper

17 input X Initialization weight W output Y = WX 1-layer: Var y = (n +, Var w )Var[x] Multi-layer: If: Linear activation x, y, w: independent Then: n +, n 567 Var y = (2 n 3 +, Var w 3 3 )Var[x] LeCun et al 1998 Efficient Backprop Glorot & Bengio 2010 Understanding the difficulty of training deep feedforward neural networks

18 Initialization Both forward (response) and backward (gradient) signal can vanish/explode Forward: Var y = (2 n +, 3 Var w 3 Backward: 3 Var x = (2 n Var w 3 3 )Var[x] )Var[ y ] exploding ideal vanishing depth LeCun et al 1998 Efficient Backprop Glorot & Bengio 2010 Understanding the difficulty of training deep feedforward neural networks

19 Initialization Initialization under linear assumption 3 n +, 3 Var w 3 = const >? (healthy forward) and 3 n Var w 3 = (healthy backward) n 3 +, Var w 3 = 1 or* n Var w 3 = 1 *: n = n +, 3BC, so D5,E7 FG D5,E7 HG =, IJKL Xavier init in Caffe LeCun et al 1998 Efficient Backprop Glorot & Bengio 2010 Understanding the difficulty of training deep feedforward neural networks MNO RS <., HPQKL It is sufficient to use either form.

20 Initialization Initialization under activation C 3 V n 3 +, Var w 3 = const >? (healthy forward) and C 3 V n Var w 3 = (healthy backward) 1 2 n 3 +, Var w 3 or = n Var w 3 = 1 MSRA init in Caffe With D layers, a factor of 2 per layer has exponential impact of 2 Y Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015.

21 Initialization 22-layer net: good initconverges faster 30-layer net: good init is able to converge 1 ours 2 nvar w = 1 1 nvar w Xavier = 1 ours nvar w = 1 2 Xavier nvar w = 1 *Figures show the beginning of training Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015.

22 Batch Normalization (BN) Normalizing input (LeCun et al 1998 Efficient Backprop ) BN: normalizing each layer, for each mini-batch Greatly accelerate training Less sensitive to initialization Improve regularization S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015

23 Batch Normalization (BN) layer x xz = x μ σ y = γxz + β μ: mean of x in mini-batch σ: std of x in mini-batch γ: scale β: shift μ, σ: functions of x, analogous to responses γ, β: parameters to be learned, analogous to weights S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015

24 Batch Normalization (BN) layer x xz = x μ σ y = γxz + β 2 modes of BN: Train mode: μ, σ are functions of x; backprop gradients Test mode: μ, σ are pre-computed* on training set Caution: make sure your BN is in a correct mode *: by running average, or post-processing after training S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015

25 Batch Normalization (BN) best of w/ BN w/o BN accuracy Figure taken from [S. Ioffe & C. Szegedy] S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015 iter.

26 Deep Residual Networks From 10 layers to 100 layers

27 Going Deeper Initialization algorithms Batch Normalization Is learning better networks as simple as stacking more layers? Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

28 Simply stacking layers? 20 train error (%) 56-layer CIFAR test error (%) 56-layer layer layer iter. (1e4) iter. (1e4) Plain nets: stacking 3x3 conv layers 56-layer net has higher training error and test error than 20-layer net Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

29 Simply stacking layers? error (%) CIFAR layer 44-layer 32-layer 20-layer error (%) ImageNet layer 5 plain-20 plain-32 plain plain iter. (1e4) solid: test/val dashed: train 30 plain plain iter. (1e4) layer Overly deep plain nets have higher training error A general phenomenon, observed in many datasets Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

30 a shallower model (18 layers) 7x7 conv, 64, /2 7x7 conv, 64, /2 a deeper counterpart (34 layers), /2, /2 Richer solution space, /2 extra layers, /2 A deeper model should not have higher training error A solution by construction: original layers: copied from a learned shallower model extra layers: set as identity at least the same training error, /2, /2 Optimization difficulties: solvers cannot find the solution when going deeper fc 1000 fc 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

31 Deep Residual Learning Plaint net x H x is any desired mapping, hope the 2 weight layers fit H(x) any two stacked layers weight layer weight layer H(x) relu relu Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

32 Deep Residual Learning Residual net x H x is any desired mapping, hope the 2 weight layers fit H(x) weight layer hope the 2 weight layers fit F(x) F(x) relu weight layer identity x let H x = F x + x H x = F x + x relu Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

33 Deep Residual Learning F x is a residualmapping w.r.t. identity x F(x) weight layer relu weight layer identity x If identity were optimal, easy to set weights as 0 If optimal mapping is closer to identity, easier to find small fluctuations H x = F x + x relu Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

34 Related Works Residual Representations VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007] Encoding residual vectors; powerful shallower representations. Product Quantization (IVF-ADC) [Jegou et al 2011] Quantizing residual vectors; efficient nearest-neighbor search. MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006] Solving residual sub-problems; efficient PDE solvers. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

35 7x7 conv, 64, /2 7x7 conv, 64, /2 pool, /2 pool, /2 Network Design plain net ResNet, /2, /2 Keep it simple Our basic design (VGG-style) all 3x3 conv (almost) spatial size /2 => # filters x2 (~same complexity per layer) Simple design; just deep! Other remarks: no hidden fc no dropout, /2, /2 avg pool, /2, /2 avg pool Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR fc 1000 fc 1000

36 Training All plain/residual nets are trained from scratch All plain/residual nets use Batch Normalization Standard hyper-parameters & augmentation Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

37 CIFAR-10 experiments error (%) CIFAR-10 plain nets 5 plain-20 plain-32 plain plain iter. (1e4) layer 44-layer 32-layer 20-layer solid: test dashed: train error (%) CIFAR-10 ResNets ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet iter. (1e4) 20-layer 32-layer 44-layer 56-layer 110-layer Deep ResNets can be trained without difficulties Deeper ResNets have lower training error, and also lower test error Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

38 ImageNet experiments ImageNet plain nets ImageNet ResNets error (%) layer error (%) layer 30 solid: test plain-18 dashed: train plain iter. (1e4) 18-layer 30 ResNet-18 ResNet iter. (1e4) layer Deep ResNets can be trained without difficulties Deeper ResNets have lower training error, and also lower test error Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

39 ImageNet experiments A practical design of going deeper 64-d 256-d 3x3, 64 relu 3x3, 64 1x1, 64 relu 3x3, 64 relu 1x1, 256 relu all-3x3 similar complexity relu bottleneck (for ResNet-50/101/152) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

40 ImageNet experiments this model has lower time complexity than VGG-16/19 DeeperResNets have lowererror ResNet-152 ResNet-101 ResNet-50 ResNet crop testing, top-5 val error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

41 ImageNet experiments 152 layers layers 19 layers layers 8 layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error (%) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

42 Discussions Representation, Optimization, Generalization

43 Issues on learning deep models Representation ability Optimization ability Ability of model to fit training data, if optimum could be found If model A s solution space is a superset of B s, A should be better. Feasibility of finding an optimum Not all models are equally easy to optimize Generalization ability Once training data is fit, how good is the test performance

44 How do ResNets address these issues? Representation ability No explicit advantage on representation (only re-parameterization), but Allow models to go deeper Optimization ability Generalization ability Enable very smooth forward/backward prop Greatly ease optimizing deeper models Not explicitly address generalization, but Deeper+thinner is good generalization

45 On the Importance of Identity Mapping From 100 layers to 1000 layers

46 On identity mappings for optimization F(x l ) layer x l h(x c ) shortcut mapping: h = identity after-add mapping: f = layer x cbc = f(h x c + F x c ) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

47 On identity mappings for optimization F(x l ) layer x l h(x c ) shortcut mapping: h = identity after-add mapping: f = What if f = identity? layer x cbc = f(h x c + F x c ) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

48 On identity mappings for optimization F(x l ) layer x l h(x c ) shortcut mapping: h = identity after-add mapping: f = What if f = identity? layer x cbc = f(h x c + F x c ) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

49 Very smooth forward propagation x cbc = x c + F x c x cbv = x cbc + F x cbc Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

50 Very smooth forward propagation x cbc = x c + F x c x cbv = x cbc + F x cbc x cbv = x c + F x c + F x cbc Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

51 Very smooth forward propagation x cbc = x c + F x c x cbv = x cbc + F x cbc x cbv = x c + F x c + F x cbc gic x g = x c + h F x + +jc Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

52 Very smooth forward propagation gic x g = x c + h F x + +jc Any x c is directly forward-prop to any x g, plus residual. Any x g is an additive outcome. in contrast to multiplicative:x g = W + x c gic +jc x c x g 7x7 conv, 64, /2 pool, /2, /2, /2, /2 7x7 conv, 64, /2 pool, /2, /2, /2, /2 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings fc in 1000Deep Residual fc 1000Networks. arxiv avg pool avg pool

53 Very smooth backward propagation gic 7x7 conv, 64, /2 pool, /2, /2 7x7 conv, 64, /2 pool, /2, /2 x g = x c + h F x + +jc E, /2, /2 gic E = E x g = E (1 + h F x x c x g x c x g x + c +jc ) x c E x g, /2, /2 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings fc in 1000Deep Residual fc 1000Networks. arxiv avg pool avg pool

54 Very smooth backward propagation 7x7 conv, 64, /2 pool, /2 7x7 conv, 64, /2 pool, /2 gic E = E (1 + h F x x c x g x + c +jc Any lm ln o is directly back-prop to any lm ln p, plus residual. Any lm ln p is additive; unlikely to vanish in contrast to multiplicative: lm ln p = gic +jc ) W + lm ln o E x c E x g, /2, /2, /2, /2, /2, /2 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings fc in 1000Deep Residual fc 1000Networks. arxiv avg pool avg pool

55 Residual for every layer forward: backward: gic x g = x c + h F x + +jc gic E = E (1 + h F x x c x g x + c +jc Enabled by: shortcut mapping: h = identity after-add mapping: f = identity ) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

56 Experiments Set 1: what if shortcut mapping h identity Set 2: what if after-add mapping f is identity Experiments on ResNets with more than 100 layers deeper models suffer more from optimization difficulty

57 Experiment Set 1: what if shortcut mapping h identity?

58 h x = x error: 6.6% addition 3x3 conv 3x3 conv (a) original addition 3x3 conv 3x3 conv * ResNet-110 on CIFAR-10 (b) constant scaling h x = 0.5x error: 12.4% h x = gate x error: 8.7% *similar to Highway Network addition 1x1 conv sigmoid 1-3x3 conv 3x3 conv (c) exclusive gating addition 1x1 conv sigmoid 1-3x3 conv 3x3 conv (d) shortcut-only gating h x = gate x error: 12.9% h x = conv(x) error: 12.2% 1x1 conv 3x3 conv 3x3 conv dropout 3x3 conv 3x3 conv h x = dropout(x) error: > 20% addition (e) conv shortcut addition (f) dropout shortcut Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

59 h x = x error: 6.6% addition 3x3 conv 3x3 conv (a) original addition 3x3 conv 3x3 conv * ResNet-110 on CIFAR-10 (b) constant scaling h x = 0.5x error: 12.4% h x = gate x error: 8.7% h x = conv(x) error: 12.2% addition 1x1 conv 1x1 conv sigmoid 1-3x3 conv 3x3 conv 3x3 conv 3x3 conv (c) exclusive shortcuts gating blocked by multiplications addition dropout 1x1 conv sigmoid 1- (d) shortcut-only gating 3x3 conv 3x3 conv 3x3 conv 3x3 conv h x = gate x error: 12.9% h x = dropout(x) error: > 20% addition (e) conv shortcut addition (f) dropout shortcut Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

60 If h is multiplicative, e.g. h x = λx forward: gic x g = λ gic x c + h Fu x + +jc if h is multiplicative, shortcuts are blocked direct propagation is decayed backward: gic E = E (λ gic + h Fu x x c x g x + c +jc ) *assuming f = identity Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

61 addition 1x1 conv sigmoid 1-3x3 conv 3x3 conv h is gating solid: test dashed: train 3x3 conv 3x3 conv h is identity addition gating should have better representation ability (identity is a special case), but optimization difficulty dominates results Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

62 Experiment Set 2: what if after-add mapping f is identity

63 x l x l x l weight BN weight weight BN weight BN weight BN BN addition addition BN weight x l+1 x l+1 addition x l+1 f is (original ResNet) f is BN+ f is identity (pre-activation ResNet) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

64 x l x l weight BN weight weight BN weight BN addition solid: test dashed: train f = BN+ addition x l+1 f = BN x l+1 f = BN+ f = BN could block prop Keep the shortest pass as smooth as possible Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

65 1001-layer ResNets on CIFAR-10 x l x l weight BN weight BN BN weight BN addition weight solid: test dashed: train f = identity f = x l+1 f = addition x l+1 f = identity could block prop when there are 1000 layers pre-activation design eases optimization (and improves generalization; see paper) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

66 Comparisons on CIFAR-10/100 CIFAR-10 method error (%) NIN 8.81 DSN 8.22 FitNet 8.39 Highway 7.72 ResNet-110 (1.7M) 6.61 ResNet-1202 (19.4M) 7.93 ResNet-164, pre-activation (1.7M) 5.46 ResNet-1001, pre-activation (10.2M) 4.92 (4.89±0.14) CIFAR-100 method error (%) NIN DSN FitNet Highway ResNet-164 (1.7M) ResNet-1001 (10.2M) ResNet-164, pre-activation (1.7M) ResNet-1001, pre-activation (10.2M) (22.68±0.22) *all based on moderate augmentation Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

67 ImageNet Experiments ImageNet single-crop (320x320) val error method data augmentation top-1 error (%) top-5 error (%) ResNet-152, original scale ResNet-152, pre-activation scale ResNet-200, original scale ResNet-200, pre-activation scale ResNet-200, pre-activation scale + aspect ratio 20.1 * 4.8 * * independently reproduced by: training code and models available. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

68 Summary of observations Keep the shortest path as smooth as possible by making h and f identity forward/backward signals directly flow through this path Features of any layers are additive outcomes 1000-layer ResNets can be easily trained and have better accuracy x l addition x l+1 BN weight BN weight Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

69 Future Works Representation skipping 1 layer vs. multiple layers? Flat vs. Bottleneck? Inception-ResNet [Szegedy et al 2016] ResNet in ResNet [Targ et al 2016] Width vs. Depth [Zagoruyko & Komodakis 2016] Generalization DropOut, MaxOut, DropConnect, Drop Layer (Stochastic Depth) [Huang et al 2016] Optimization Without residual/shortcut? x l addition x l+1 BN weight BN weight Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

70 Applications Features matter

71 Features matter. (quote [Girshick et al. 2014], the R-CNN paper) task 2nd-place winner ResNets margin (relative) ImageNet Localization (top-5 error) % absolute 8.5% better! ImageNet Detection % COCODetection % COCO Segmentation % Our results are all based on ResNet-101 Deeper features are well transferrable Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

72 Revolution of Depth 101 layers Engines of visual recognition shallow 8 layers 16 layers HOG, DPM AlexNet (RCNN) VGG (RCNN) ResNet (Faster RCNN)* PASCAL VOC 2007 Object Detection map (%) *w/ other improvements & more data Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016.

73 Deep Learning for Computer Vision detection network (e.g. R-CNN) ImageNet data segmentation network (e.g. FCN) target data backbone structure pre-train classification network features... human pose estimation network fine-tune depth estimation network

74 Example: Object Detection ü boat ü person Image Classification (what?) Object Detection (what + where?)

75 Object Detection: R-CNN warped region CNN figure credit: R. Girshick et al. aeroplane? no... person? yes... tvmonitor? no. input image region proposals ~2,000 1 CNN for each region classify regions Region-based CNN pipeline Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014

76 Object Detection: R-CNN R-CNN feature feature feature feature CNN CNN CNN CNN End-to-End training pre-computed Regions-of-Interest (RoIs) image Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014

77 Object Detection: Fast R-CNN Fast R-CNN feature feature pre-computed Regions-of-Interest (RoIs) feature RoI pooling End-to-End training shared conv layers CNN image Girshick. Fast R-CNN. ICCV 2015

78 Object Detection: Faster R-CNN Faster R-CNN Solely based on CNN No external modules Each step is end-to-end proposals Region Proposal Net features RoI pooling End-to-End training feature map CNN image Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

79 Object Detection ImageNet data detection data backbone structure pre-train classification network features detection network fine-tune plug-in features AlexNet VGG-16 GoogleNet ResNet-101 independently developed R-CNN Fast R-CNN Faster R-CNN MultiBox SSD detectors

80 Object Detection Simply Faster R-CNN + ResNet proposals classifier RoI pooling Faster R-CNN baseline map@.5 map@.5:.95 VGG ResNet Region Proposal Net feature map COCO detection results ResNet-101 has 28% relative gain vs VGG-16 CNN image Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

81 Object Detection RPN learns proposals by extremely deep nets We use only 300 proposals (no hand-designed proposals) Add components: Iterative localization Context modeling Multi-scale testing All are based on CNN features; all are end-to-end All benefit more from deeper features cumulative gains! Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

82 ResNet s object detection result on COCO *the original image is from the COCO dataset Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

83 *the original image is from the COCO dataset Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

84 *the original image is from the COCO dataset Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

85 this video is available online: Results on real video. Models trained on MS COCO (80 categories). (frame-by-frame; no temporal processing) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

86 More Visual Recognition Tasks ResNet-based methods lead on these benchmarks (incomplete list): ImageNet classification, detection, localization MS COCO detection, segmentation PASCAL VOC detection, segmentation ResNet-101 Human pose estimation [Newell et al 2016] Depth estimation [Laina et al 2016] Segment proposal [Pinheiro et al 2016] PASCAL segmentation leaderboard ResNet-101 PASCAL detection leaderboard

87 Potential Applications Visual Recognition ResNets have shown outstanding or promising results on: Image Generation (Pixel RNN, Neural Art, etc.) Natural Language Processing (Very deep CNN) Speech Recognition (preliminary results) Advertising, user prediction (preliminary results)

88 Conclusions of the Tutorial Deep Residual Learning: Ultra deep networks can be easy to train Ultra deep networks can gain accuracy from depth Ultra deep representations are well transferrable Now 200 layers on ImageNet and 1000 layers on CIFAR! Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.

89 Resources Models and Code Our ImageNet models in Caffe: Many available implementation (list in Facebook AI Research s Torch ResNet: Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code Torch, MNIST, 100 layers: blog, code A winning entry in Kaggle's right whale recognition challenge: blog, code Neon, Place2 (mini), 40 layers: blog, code... Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. CVPR Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Identity Mappings in Deep Residual Networks. arxiv 2016.