Compacting ConvNets for end to end Learning

Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli.

Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012

Success of CNN Object Detection from Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arxiv:1506.01497

Success of CNN Semantic Segmentation Jifeng Dai, Kaiming He, Jian Sun, BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation, arxiv:1503.01640

Success of CNN Image Captioning Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Description, CVPR, 2015 Video classification

Key of success Better training algorithms Batch normalization Initializations Momentum

Key of success Better training algorithms Large amount of data / labels

Key of success Better training algorithms Large amount of data / labels Hardware / Storage GPU, parallel systems Memory GPU (in Gb) 14 12 10 8 6 4 2 0 GTX-580 Titan Black ('14) Titan X ('15)

Key of success Better training algorithms Large amount of data / labels Hardware / Storage Larger community of researchers

Key of success Enabled larger networks 160 140 120 100 80 60 40 20 0 Num. Parameters (in Millions) LeNet-5 AlexNet VGGNet-16

Key of success 150 Num. Parameters (in Millions) 100 50 0 LeNet-5 AlexNet VGGNet-16

Challenges Embedded devices with limited resources / power 2014 Jetson TK1 2015/16 Jetson TX1

Challenges Embedded devices with limited resources / power - Memory is a limiting factor - Real time operation

Computational Cost AlexNet Forward-pass is time consuming

Computational Cost AlexNet Memory bottleneck

Computational Cost VGGNet Memory bottleneck conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544

Do we need all these parameters?

Over-Parameterization Needed for high non-convex optimization 1 Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun. The Loss Surfaces of Multilayer Networks

Over-Parameterization Needed for high non-convex optimization Deeper structures, larger learning capacity 1 1 Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio. On the Number of Linear Regions of Deep Neural Networks. NIPS 2014

Over-Parameterization Needed for high non-convex optimization Deeper structures, larger learning capacity From images to Video -> Even larger nets? A. Karpathy et. al. Large-scale Video Classification with Convolutional Neural Networks. CVPR 2014.

Compacting CNN

Compacting CNN Network distillation Network pruning Structured parameters Ours

Compacting CNN Network distillation

Compacting CNN Network distillation Large network learns from data Generate labels using the trained network Train smaller nets using the output or soft layer Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the Knowledge in a Neural Network. NIPSw 2015

Compacting CNN Network distillation (II) Use intermediate layers to guide the training Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta and Yoshua Bengio. FitNets: Hints for Thin Deep Nets. ICLR 2015

Compacting CNN Pros In general better generalization and faster. Equal or slightly better performance Cons Requires a larger network to learn from.

Compacting CNN Network distillation Network pruning Directly remove unimportant parameters during training Requires second derivatives. Remove parameters + quantification 1 Good compression rates (orthogonal to other approaches) 1 S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015

Compacting CNN Network distillation Network pruning Structured parameters

Compacting CNN: Structured parameters Low rank approximations Max Jaderberg, Andrea Vedaldi, Andrew Zisserman Speeding up Convolutional Neural Networks with Low Rank Expansions. BMVC 2014

Compacting CNN: Structured parameters Low rank approximations (II) Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014

Compacting CNN: Structured parameters Low rank approximations (III) Weights are approximated by a sum of rank 1 tensors. Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014

Compacting CNN: Structured parameters Weak-Points Needs a full-rank network completely trained Not all filters can be approximated Theoretical speeds-up with drop of performance. Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014

Compacting CNN: Structured parameters Weak-Points Needs a full-rank network completely trained. Not all filters can be approximated. Drop of performance. Strengths Potential ability to aid in regularization during or post training. Parameter sharing within the layer.

Compacting CNN: Structured parameters Low rank approximations (IV) VGG nets restrict filters during training. Same receptive field Deeper networks (more nonlinearities) Less parameters (49C 2 vs 3x(3x3)C 2 ) K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015

Compacting CNN: Structured parameters Low rank approximations (Ours 1 ) Filter restriction during training. Larger receptive fields Deeper networks (more nonlinearities) Parameter sharing Less parameters 1 Joint work with Lars Pertersson. Under review

Compacting CNN: Structured parameters Low rank approximations (Ours) ImageNet Results (AlexNet). Baseline: Alex Krizhevsky. Ilya Sutskever. Geoffrey Hinton. ImageNet Classification with Deep. Convolutional Neural Networks. NIPS 2012

Compacting CNN: Structured parameters Low rank approximations (Ours) Stereo Matching. Ours-3 32K Ours-1 32K Ours-1 48K Baseline: Jure Zbontar, Yann LeCun. Computing the Stereo Matching Cost With a Convolutional Neural Network. CVPR 2015

Memory?

Computational Cost VGGNet Memory bottleneck conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544

Computational Cost AlexNet Memory bottleneck

Memory Bottleneck Sparse constraints during training (Ours 2 ) Directly reduce the number of neurons. Select the optimum number of neurons. Significant memory reductions with minor drop of performance 2 Joint work with Hao Zhou, Fatih Porikli. Under review

Memory Bottleneck Sparse constraints during training (Ours 2 ) 2 Joint work with Hao Zhou, Fatih Porikli. Under review

Do we need all these parameters?

Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli.