ECE 901 Paper Presentation. Training Deep Nets with Sublinear Memory Cost

Similar documents
Fast R-CNN. Author: Ross Girshick Speaker: Charlie Liu Date: Oct, 13 th. Girshick, R. (2015). Fast R-CNN. arxiv preprint arxiv:

Steven C.H. Hoi School of Information Systems Singapore Management University

Deep Neural Networks in Embedded and Real Time Systems

Training R- CNNs of various velocities Slow, fast, and faster

Deep Learning on GPUs. March 2016

SIGNAL INTERPRETATION

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster

Pedestrian Detection with RCNN

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Lecture 8 February 4

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Multilayer Percetrons

E6895 Big Data Analytics: Lecture 12. Mobile Vision. Ching-Yung Lin, Ph.D. IBM Chief Scientist, Graph Computing. December 1st, 2016

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Image and Video Understanding

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

WORKFLOW ENGINE FOR CLOUDS

CSC 321 H1S Study Guide (Last update: April 3, 2016) Winter 2016

Getting Started with Caffe Julien Demouth, Senior Engineer

PhD in Computer Science and Engineering Bologna, April Machine Learning. Marco Lippi. Marco Lippi Machine Learning 1 / 80

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection

Implementation of Neural Networks with Theano.

Recurrent (Neural) Networks: Part 1. Tomas Mikolov, Facebook AI Research Deep Learning NYU, 2015

Compiler Design Prof. Y. N. Srikant Department of Computer Science and Automation Indian Institute of Science, Bangalore

A brief intro to computer vision and neural networks. 28 April 2016 Pamela Toman

(Refer Slide Time: 01:35)

CSC321 Introduction to Neural Networks and Machine Learning. Lecture 21 Using Boltzmann machines to initialize backpropagation.

Using Convolutional Neural Network for the Tiny ImageNet Challenge

Compacting ConvNets for end to end Learning

Memory Allocation. Static Allocation. Dynamic Allocation. Memory Management. Dynamic Allocation. Dynamic Storage Allocation

Big Data looks Tiny from the Stratosphere

Instruction scheduling

RCNN, Fast RCNN, Faster RCNN

Lecture 14: Convolutional neural networks for computer vision

Observer Localization using ConvNets

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

Image Classification for Dogs and Cats

Recognizing Handwritten Digits and Characters

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Technical aspects of practical (do this before reading)

ML challenges in E-commerce. Mayur Datar Principal Data Scientist, Flipkart

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Convolutional Neural Networks for Named. Entity Recognition in Images of Documents

Back Propagation Neural Networks User Manual

An Overview of a Compiler

DYNAMIC FINE-GRAIN SCHEDULING

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Regularization of Neural Networks using DropConnect

Feedforward Neural Networks and Backpropagation

The scenario: Initial state Target (Goal) states A set of intermediate states A set of operations that move the agent from one state to another.

Analysis of Compression Algorithms for Program Data

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

DATABASDESIGN FÖR INGENJÖRER - 1DL124

Caffe con Troll: Shallow Ideas to Speed Up Deep Learning. Stefan Hadjis 1, Firas Abuzaid 1, Ce Zhang 1,2,Christopher Ré 1

A Dynamic Query Processing Architecture for Data Integration Systems

ECE 740. Optimal Power Flow

Clipping polygons the Sutherland-Hodgman algorithm

Large-Scale Deep Learning for Intelligent Computer Systems

Module 5. Deep Convnets for Local Recognition Joost van de Weijer 4 April 2016

Fast R-CNN Object detection with Caffe

Applications of Deep Learning to the GEOINT mission. June 2015

Chapter 4: Artificial Neural Networks

Hardware design for ray tracing

CSE 504: Compiler Design. Data Flow Analysis

The Design of the Inferno Virtual Machine. Introduction

PTask: Operating System Abstractions To Manage GPUs as Compute Devices

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Sample Problems in Discrete Mathematics

GPU Accelerated Pathfinding

Runtime Hardware Reconfiguration using Machine Learning

Advanced analytics at your hands

Judging a Movie by its Poster using Deep Learning

Recurrent Neural Networks

The Application of Heuristics in Engineering. John Ypsilantis, PhD Managing Director, Heuristics Australia Pty Ltd

Deep Learning GPU-Based Hardware Platform

Lecture 6. Artificial Neural Networks

Analecta Vol. 8, No. 2 ISSN

An Active Linear Algebra Library Using Delayed Evaluation and Runtime Code Generation

Learning to Process Natural Language in Big Data Environment

Tracking Groups of Pedestrians in Video Sequences

SELECTING NEURAL NETWORK ARCHITECTURE FOR INVESTMENT PROFITABILITY PREDICTIONS

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Background. Linear Convolutional Encoders (2) Output as Function of Input (1) Linear Convolutional Encoders (1)

DL4MT Winter School Final Report

MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute,

The A* Search Algorithm. Siyang Chen

How To Get A Computer Science Degree

Geometrical Approaches for Artificial Neural Networks

Predictive Dynamix Inc

DYNAMIC LOAD BALANCING IN CLOUD AD-HOC NETWORK

Task Scheduling in Hadoop

Performance Improvement In Java Application

Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation

Bert Huang Department of Computer Science Virginia Tech

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Human Pose Estimation and Activity Classification Using Convolutional Neural Networks

Uninformed Search. Chapter 3. (Based on slides by Stuart Russell, Dan Weld, Oren Etzioni, Henry Kautz, and other UW-AI faculty)

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

General purpose Distributed Computing using a High level Language. Michael Isard

Data Mining Techniques

Transcription:

ECE 901 Paper Presentation Training Deep Nets with Sublinear Memory Cost Tianqi Chen, Bing Xu, Chiyuan Zhang and Carlos Guestrin Presented by: Akshay Sood, Sneha Rudra

Computation Graph A computation graph - used in many frameworks Theano, TensorFlow, MXNet Nodes represent operations and edges represent the dependencies between the operations Softmax Sigmoid Source: http://cs231n.github.io/

Memory Optimization with Computation Graph Deep net training often needs large memory to store the intermediate outputs and gradients Each intermediate result corresponds to a node in computation graph Two types of memory optimizations can be used Inplace operation: Directly store the output values to memory of a input value Memory sharing: Memory used by intermediate results that are no longer needed can be recycled and used in another node

Memory Optimization with Computation Graph First sigmoid transformation carried out using inplace operation, which is then reused by its backward operation. The storage of the softmax gradient is shared with the gradient by the first fully connected layer. Memory allocation for each node, same color indicates shared memory

O(n) Memory Allocation Heuristic Memory sharing only between nodes whose lifetimes do not overlap One option to identify such nodes - run a graph-coloring algorithm on conflicting graph - but this would cost O(n2) computation time Heuristic with O(n) time complexity Traverse the graph in topological order, and use a counter to indicate the liveness of each record Inplace operation when no other pending operations that depend on node s input (i.e. when node counter =1) Memory sharing when a recycled tag is used by another node (i.e. when node counter = 0)

O(n) Memory Allocation Heuristic Used as a static algorithm, to allocate the memory to each node before the execution to avoid overhead of garbage collection during runtime

Memory-computation tradeoff Memory optimization over the computation graph can reduce memory footprint for both training and prediction. However, most gradient operators depend on the intermediate results of the forward pass, so we still need O(n) memory for intermediate results to train the network Idea: drop some of the intermediate results during the forward pass, and recover them using extra forward computation when needed

Segmentation Idea: divide the neural network into segments Only store output of each segment during the forward pass; throw away intermediate results During backprop, redo within-segment computations Image source: http://neuron.tuke.sk/jaksa/slides/jaksa-citseminar2014/nets/4-3-3-3-3-3-3-3-3-3-3.png

Repeating intermediate computation User specifies a function ( mirror count ) on the nodes of the computation graph, specifying the number of recomputations permitted for each node Trivial case: m = 0 for all nodes: original computation graph, no intermediate result dropped Common case: m(v) = 1 for nodes within each segment, m(v) = 0 for output node of each segment: recompute within-segment nodes exactly once

Example

Example Easy application: drop results of low cost operations, keep results that are expensive to compute For instance, in a Conv-BatchNorm-Activation CNN pipeline, we can keep the result of convolution but drop the result of batch normalization, activation function and pooling GoogLeNet Image source: https://indico.io/blog/wp-content/uploads/2015/08/googlenet.png

Memory complexity Assume we divide an n-node network into k segments. Then the memory complexity using a within-segment mirror factor of 1 is given by: Choosing k = n, cost-total = O( n), i.e. sublinear in the size of the network, requiring an additional forward pass during training

Recursive formulation

Recursive formulation Let g(n) be memory cost to do forward and backward pass on an n-layer network. Assume we store k intermediate results in the graph and apply strategy recursively Special case: k = 1 gives us g(n) = log2(n), with the forward pass cost O(nlog2(n)).

Experiments Memory allocation algorithms compared: No optimization - allocate memory to each node in the graph Inplace All system optimizations - Inplace + Sharing Drop batch normalization and relu + All system optimizations (only for convolutional net) Sublinear plan + All system optimizations (trade computation for memory)

Experiment 1 - Deep Convolutional Network CNN deep residual network architecture (ResNet) for image classification System optimizations help reduce memory cost by factor of two to three (but memory cost still linear with respect to # of layers, only possible to train a 200 layer ResNet with the best GPU) Sublinear Plan - 1000 layer ResNet using < 7GB of GPU memory

Experiment 2 - LSTM for Long Sequences LSTM under a long sequence unrolling setting - typical setting for speech recognition. Sublinear plan gives more than 4x reduction over the optimized memory plan

Impact on Training Speed Runtime cost is benchmarked on a single Titan X GPU Sublinear allocation strategy costs 30% additional runtime compared to the normal strategy (double forward cost in gradient calculation)

Conclusion Systematic approach to reduce the memory consumption of the intermediate feature map in deep net training Computation graph liveness analysis used to enable memory sharing Possible to trade computation with memory Combination of techniques can help train n layer deep neural network with only O( n) memory cost with one extra forward computation cost per mini-batch.