Deep Learning and GPUs Julie Bernauer
GPU Computing 2
GPU Computing x86 3
CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int n, const float *a, const float *b, float *c) { for( int idx = 0 ; idx < n ; ++idx ) c[idx] = a[idx] + b[idx]; } GPU friendly version in CUDA global void vector_add(int n, const float *a, const float *b, float *c) { int idx = blockidx.x*blockdim.x + threadidx.x; if( idx < n ) c[idx] = a[idx] + b[idx]; } 4
GPU accelerated libraries Drop-in Acceleration for Your Applications Linear Algebra FFT, BLAS, SPARSE, Matrix, cusolver NVIDIA cufft, cublas, cusparse Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA curand Data Struct. & AI Sort, Scan, Zero Sum GPU AI Board Games GPU AI Path Finding Visual Processing Image & Video NVIDIA NPP NVIDIA Video Encode 5
Deep Neural Networks 6
Image Classification with DNNs Training Inference cars buses trucks motorcycles truck 7
Image Classification with DNNs Training cars buses trucks motorcycles Typical training run Pick a DNN design Input 100 million training images spanning 1,000 categories One week of computation Test accuracy If bad: modify DNN, fix training set or update training parameters 8
Deep Neural Networks and GPUs 9
GOOGLE DATACENTER STANFORD AI LAB ACCELERATING ACCELERATING INSIGHTS INSIGHTS Now You Can Build Google s $1M Artificial Brain on the Cheap 1,000 CPU Servers 2,000 CPUs 16,000 cores 600 kwatts $5,000,000 3 GPU-Accelerated Servers 12 GPUs 18,432 cores 4 kwatts $33,000 Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013 10
Modern AI 100% IMAGENET Accuracy Rate 90% 80% 70% 60% 50% 40% Traditional CV Deep Learning 30% 20% 10% 0% 2009 2010 2011 2012 2013 2014 2015 2016 2012: GOOGLE BRAIN 2016: AlphaGO 11
DEEP LEARNING EVERYWHERE INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation Cancer Cell Detection Diabetic Grading Drug Discovery Video Captioning Video Search Real Time Translation Face Detection Video Surveillance Satellite Imagery Pedestrian Detection Lane Tracking Recognize Traffic Sign 12
NVIDIA GPU: the engine of deep learning WATSON CHAINER THEANO MATCONVNET TENSORFLOW CNTK TORCH CAFFE NVIDIA CUDA ACCELERATED COMPUTING PLATFORM 13
Performance Accelerating Deep Learning: cudnn Caffe Performance GPU-accelerated Deep Learning subroutines High performance neural network training 6 5 4 3 M40+cuDNN3 M40+cuDNN4 Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch 2 1 K40 K40+cuDNN1 Up to 3.5x faster AlexNet training in Caffe than baseline GPU 0 11/2013 9/2014 7/2015 12/2015 AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04 developer.nvidia.com/cudnn 14
Multi-GPU communication: NCCL Collective library Research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-gpu applications Pattern the library after MPI s collectives Handle the intra-node communication in an optimal way Provide the necessary functionality for MPI to build on top to handle inter-node github.com/nvidia/nccl 15
NCCL Example All-reduce #include <nccl.h> ncclcomm_t comm[4]; ncclcomminitall(comm, 4, {0, 1, 2, 3}); foreach g in (GPUs) { // or foreach thread } cudasetdevice(g); double *d_send, *d_recv; // allocate d_send, d_recv; fill d_send with data ncclallreduce(d_send, d_recv, N, nccldouble, ncclsum, comm[g], stream[g]); // consume d_recv 16
Platform 17
An end-to-end solution Data Scientist Embedded platform Solver Network Dashboard Train Model Deploy Classification Detection Segmentation 18
CUDA for deep learning development DEEP LEARNING SDK DIGITS cudnn cusparse cublas NCCL TITAN X DEVBOX GPU CLOUD 19
Tesla for hyperscale datacenters 10M Users 40 years of video/day HYPERSCALE SUITE Deep Learning SDK GPU REST Engine GPU Accelerated FFmpeg Image Compute Engine GPU support in Mesos 270M Items sold/day TESLA M40 43% on mobile devices POWERFUL: Fastest Deep Learning Performance TESLA M4 LOW POWER: Highest Hyperscale Throughput 20
Jetson for Intelligent machines 10M Users JETSON SDK 40 years of video/day JETSON TX1 270M Items sold/day 43% on mobile devices GPU CPU Memory Storage Size Power 1 TFLOP/s 256-core Maxwell 64-bit ARM A57 CPUs 4 GB LPDDR4 25.6 GB/s 16 GB emmc 50mm x 87mm Under 10W 21
Using the GPU for Deep Learning 22
DIGITS TM Interactive Deep Learning GPU Training System Quickly design the best deep neural network (DNN) for your data Train on multi-gpu (automatic) Visually monitor DNN training quality in real-time Manage training of many DNNs in parallel on multi-gpu systems developer.nvidia.com/digits 23
For the developer: DIGITS TM devbox Fastest Deskside Deep Learning Training Box Four TITAN X GPUs with 12GB of memory per GPU 1600W Power Supply Unit Ubuntu 14.04 NVIDIA-qualified driver NVIDIA CUDA Toolkit 7.0 NVIDIA DIGITS SW Caffe, Theano, Torch, BIDMach 24
For the datacenter: multi-gpu servers For accelerated training Facebook Big Sur Opencompute server 25
8x Faster Caffe Performance TESLA M40 World s Fastest Accelerator for Deep Learning CPU Tesla M40 Reduce Training Time from 8 Days to 1 Day 0 1 2 3 4 5 6 7 8 9 # of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory Bandwidth Power 12 GB 288 GB/s 250W Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 26
What else can one do with Deep Learning? 27
Action Recognition 28
Segmentation Clement Farabet, Camille Couprie, Laurent Najman and Yann LeCun: Learning Hierarchical Features for Scene Labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, August, 2013 https://www.youtube.com/watch?v=kknhdlns13u 29
Image Captioning Automated Image Captioning with ConvNets and Recurrent Nets Andrej Karpathy, Fei-Fei Li 30
Playing Games Google s DeepMind https://www.youtube.com/watch?v=v1eynij0rnk 31
Resources 32
Want to try? Links and resources Deep Learning https://developer.nvidia.com/deep-learning Hands-on labs https://nvidia.qwiklab.com/ Question? Email jbernauer@nvidia.com 33