Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli.
Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
Success of CNN Object Detection from Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arxiv:1506.01497
Success of CNN Semantic Segmentation Jifeng Dai, Kaiming He, Jian Sun, BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation, arxiv:1503.01640
Success of CNN Image Captioning Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Description, CVPR, 2015 Video classification
Key of success Better training algorithms Batch normalization Initializations Momentum
Key of success Better training algorithms Large amount of data / labels
Key of success Better training algorithms Large amount of data / labels Hardware / Storage GPU, parallel systems Memory GPU (in Gb) 14 12 10 8 6 4 2 0 GTX-580 Titan Black ('14) Titan X ('15)
Key of success Better training algorithms Large amount of data / labels Hardware / Storage Larger community of researchers
Key of success Enabled larger networks 160 140 120 100 80 60 40 20 0 Num. Parameters (in Millions) LeNet-5 AlexNet VGGNet-16
Key of success 150 Num. Parameters (in Millions) 100 50 0 LeNet-5 AlexNet VGGNet-16
Key of success 150 Num. Parameters (in Millions) 100 50 0 LeNet-5 AlexNet VGGNet-16
Key of success 150 Num. Parameters (in Millions) 100 50 0 LeNet-5 AlexNet VGGNet-16
Challenges Embedded devices with limited resources / power 2014 Jetson TK1 2015/16 Jetson TX1
Challenges Embedded devices with limited resources / power - Memory is a limiting factor - Real time operation
Computational Cost AlexNet Forward-pass is time consuming
Computational Cost AlexNet Memory bottleneck
Computational Cost VGGNet Memory bottleneck conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544
Do we need all these parameters?
Over-Parameterization Needed for high non-convex optimization 1 Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun. The Loss Surfaces of Multilayer Networks
Over-Parameterization Needed for high non-convex optimization Deeper structures, larger learning capacity 1 1 Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio. On the Number of Linear Regions of Deep Neural Networks. NIPS 2014
Over-Parameterization Needed for high non-convex optimization Deeper structures, larger learning capacity From images to Video -> Even larger nets? A. Karpathy et. al. Large-scale Video Classification with Convolutional Neural Networks. CVPR 2014.
Compacting CNN
Compacting CNN Network distillation Network pruning Structured parameters Ours
Compacting CNN Network distillation
Compacting CNN Network distillation Large network learns from data Generate labels using the trained network Train smaller nets using the output or soft layer Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the Knowledge in a Neural Network. NIPSw 2015
Compacting CNN Network distillation (II) Use intermediate layers to guide the training Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta and Yoshua Bengio. FitNets: Hints for Thin Deep Nets. ICLR 2015
Compacting CNN Pros In general better generalization and faster. Equal or slightly better performance Cons Requires a larger network to learn from.
Compacting CNN Network distillation Network pruning Directly remove unimportant parameters during training Requires second derivatives. Remove parameters + quantification 1 Good compression rates (orthogonal to other approaches) 1 S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015
Compacting CNN Network distillation Network pruning Structured parameters
Compacting CNN: Structured parameters Low rank approximations Max Jaderberg, Andrea Vedaldi, Andrew Zisserman Speeding up Convolutional Neural Networks with Low Rank Expansions. BMVC 2014
Compacting CNN: Structured parameters Low rank approximations (II) Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014
Compacting CNN: Structured parameters Low rank approximations (III) Weights are approximated by a sum of rank 1 tensors. Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014
Compacting CNN: Structured parameters Weak-Points Needs a full-rank network completely trained Not all filters can be approximated Theoretical speeds-up with drop of performance. Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014
Compacting CNN: Structured parameters Weak-Points Needs a full-rank network completely trained. Not all filters can be approximated. Drop of performance. Strengths Potential ability to aid in regularization during or post training. Parameter sharing within the layer.
Compacting CNN: Structured parameters Low rank approximations (IV) VGG nets restrict filters during training. Same receptive field Deeper networks (more nonlinearities) Less parameters (49C 2 vs 3x(3x3)C 2 ) K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015
Compacting CNN: Structured parameters Low rank approximations (Ours 1 ) Filter restriction during training. Larger receptive fields Deeper networks (more nonlinearities) Parameter sharing Less parameters 1 Joint work with Lars Pertersson. Under review
Compacting CNN: Structured parameters Low rank approximations (Ours) ImageNet Results (AlexNet). Baseline: Alex Krizhevsky. Ilya Sutskever. Geoffrey Hinton. ImageNet Classification with Deep. Convolutional Neural Networks. NIPS 2012
Compacting CNN: Structured parameters Low rank approximations (Ours) Stereo Matching. Ours-3 32K Ours-1 32K Ours-1 48K Baseline: Jure Zbontar, Yann LeCun. Computing the Stereo Matching Cost With a Convolutional Neural Network. CVPR 2015
Memory?
Computational Cost VGGNet Memory bottleneck conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544
Computational Cost AlexNet Memory bottleneck
Memory Bottleneck Sparse constraints during training (Ours 2 ) Directly reduce the number of neurons. Select the optimum number of neurons. Significant memory reductions with minor drop of performance 2 Joint work with Hao Zhou, Fatih Porikli. Under review
Memory Bottleneck Sparse constraints during training (Ours 2 ) 2 Joint work with Hao Zhou, Fatih Porikli. Under review
Do we need all these parameters?
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli.