Deep Learning and GPUs. Julie Bernauer

Similar documents

DEEP LEARNING WITH GPUS

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU EN CALCUL SCIENTIFIQUE

GPU Parallel Computing Architecture and CUDA Programming Model

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HPC with Multicore and GPUs

FAST DATA = BIG DATA + GPU. Carlo Nardone, Senior Solution Architect EMEA Enterprise UNITO, March 21 th, 2016

Applications of Deep Learning to the GEOINT mission. June 2015

GPU-Based Deep Learning Inference:

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Accelerating CFD using OpenFOAM with GPUs

Compacting ConvNets for end to end Learning

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Deep Learning GPU-Based Hardware Platform

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

ST810 Advanced Computing

Extreme Machine Learning with GPUs John Canny

Advanced analytics at your hands

Introduction to GPU hardware and to CUDA

Graphic Processing Units: a possible answer to High Performance Computing?

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

High Performance Computing in CST STUDIO SUITE

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Getting Started with Caffe Julien Demouth, Senior Engineer

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Pedraforca: ARM + GPU prototype

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Next Generation GPU Architecture Code-named Fermi

Build GPU Cluster Hardware for Efficiently Accelerating CNN Training. YIN Jianxiong Nanyang Technological University

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

PhD in Computer Science and Engineering Bologna, April Machine Learning. Marco Lippi. Marco Lippi Machine Learning 1 / 80

Retargeting PLAPACK to Clusters with Hardware Accelerators

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

HP ProLiant SL270s Gen8 Server. Evaluation Report

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

SafePeak Case Study: Large Microsoft SharePoint with SafePeak

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Introduction to GPU Programming Languages

Parallel Computing with MATLAB

Summit and Sierra Supercomputers:

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Simplified Machine Learning for CUDA. Umar

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

CUDA Programming. Week 4. Shared memory and register

Regulating AI and Robotics

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

ultra fast SOM using CUDA

64-Bit versus 32-Bit CPUs in Scientific Computing

Cloud Computing for Agent-based Traffic Management Systems

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

BSc in Artificial Intelligence and Computer Science ABDAL MOHAMED

NVIDIA Jetson TK1 Development Kit

The Relationship between Artificial Intelligence and Finance

Learning to Process Natural Language in Big Data Environment

GeoImaging Accelerator Pansharp Test Results

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Enabling GPU-accelerated High Performance Geospatial Line-of-Sight Calculations

THE AMD MISSION 2 AN INTRODUCTION TO AMD NOVEMBER 2014

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Intelligent Heuristic Construction with Active Learning

Data Centric Systems (DCS)

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Intelligent Flexible Automation

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Pedestrian Detection with RCNN

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Interactive Level-Set Deformation On the GPU

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

GPGPUs, CUDA and OpenCL

Transcription:

Deep Learning and GPUs Julie Bernauer

GPU Computing 2

GPU Computing x86 3

CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int n, const float *a, const float *b, float *c) { for( int idx = 0 ; idx < n ; ++idx ) c[idx] = a[idx] + b[idx]; } GPU friendly version in CUDA global void vector_add(int n, const float *a, const float *b, float *c) { int idx = blockidx.x*blockdim.x + threadidx.x; if( idx < n ) c[idx] = a[idx] + b[idx]; } 4

GPU accelerated libraries Drop-in Acceleration for Your Applications Linear Algebra FFT, BLAS, SPARSE, Matrix, cusolver NVIDIA cufft, cublas, cusparse Numerical & Math RAND, Statistics NVIDIA Math Lib NVIDIA curand Data Struct. & AI Sort, Scan, Zero Sum GPU AI Board Games GPU AI Path Finding Visual Processing Image & Video NVIDIA NPP NVIDIA Video Encode 5

Deep Neural Networks 6

Image Classification with DNNs Training Inference cars buses trucks motorcycles truck 7

Image Classification with DNNs Training cars buses trucks motorcycles Typical training run Pick a DNN design Input 100 million training images spanning 1,000 categories One week of computation Test accuracy If bad: modify DNN, fix training set or update training parameters 8

Deep Neural Networks and GPUs 9

GOOGLE DATACENTER STANFORD AI LAB ACCELERATING ACCELERATING INSIGHTS INSIGHTS Now You Can Build Google s $1M Artificial Brain on the Cheap 1,000 CPU Servers 2,000 CPUs 16,000 cores 600 kwatts $5,000,000 3 GPU-Accelerated Servers 12 GPUs 18,432 cores 4 kwatts $33,000 Deep learning with COTS HPC systems, A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro ICML 2013 10

Modern AI 100% IMAGENET Accuracy Rate 90% 80% 70% 60% 50% 40% Traditional CV Deep Learning 30% 20% 10% 0% 2009 2010 2011 2012 2013 2014 2015 2016 2012: GOOGLE BRAIN 2016: AlphaGO 11

DEEP LEARNING EVERYWHERE INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation Cancer Cell Detection Diabetic Grading Drug Discovery Video Captioning Video Search Real Time Translation Face Detection Video Surveillance Satellite Imagery Pedestrian Detection Lane Tracking Recognize Traffic Sign 12

NVIDIA GPU: the engine of deep learning WATSON CHAINER THEANO MATCONVNET TENSORFLOW CNTK TORCH CAFFE NVIDIA CUDA ACCELERATED COMPUTING PLATFORM 13

Performance Accelerating Deep Learning: cudnn Caffe Performance GPU-accelerated Deep Learning subroutines High performance neural network training 6 5 4 3 M40+cuDNN3 M40+cuDNN4 Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch 2 1 K40 K40+cuDNN1 Up to 3.5x faster AlexNet training in Caffe than baseline GPU 0 11/2013 9/2014 7/2015 12/2015 AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04 developer.nvidia.com/cudnn 14

Multi-GPU communication: NCCL Collective library Research library of accelerated collectives that is easily integrated and topology-aware so as to improve the scalability of multi-gpu applications Pattern the library after MPI s collectives Handle the intra-node communication in an optimal way Provide the necessary functionality for MPI to build on top to handle inter-node github.com/nvidia/nccl 15

NCCL Example All-reduce #include <nccl.h> ncclcomm_t comm[4]; ncclcomminitall(comm, 4, {0, 1, 2, 3}); foreach g in (GPUs) { // or foreach thread } cudasetdevice(g); double *d_send, *d_recv; // allocate d_send, d_recv; fill d_send with data ncclallreduce(d_send, d_recv, N, nccldouble, ncclsum, comm[g], stream[g]); // consume d_recv 16

Platform 17

An end-to-end solution Data Scientist Embedded platform Solver Network Dashboard Train Model Deploy Classification Detection Segmentation 18

CUDA for deep learning development DEEP LEARNING SDK DIGITS cudnn cusparse cublas NCCL TITAN X DEVBOX GPU CLOUD 19

Tesla for hyperscale datacenters 10M Users 40 years of video/day HYPERSCALE SUITE Deep Learning SDK GPU REST Engine GPU Accelerated FFmpeg Image Compute Engine GPU support in Mesos 270M Items sold/day TESLA M40 43% on mobile devices POWERFUL: Fastest Deep Learning Performance TESLA M4 LOW POWER: Highest Hyperscale Throughput 20

Jetson for Intelligent machines 10M Users JETSON SDK 40 years of video/day JETSON TX1 270M Items sold/day 43% on mobile devices GPU CPU Memory Storage Size Power 1 TFLOP/s 256-core Maxwell 64-bit ARM A57 CPUs 4 GB LPDDR4 25.6 GB/s 16 GB emmc 50mm x 87mm Under 10W 21

Using the GPU for Deep Learning 22

DIGITS TM Interactive Deep Learning GPU Training System Quickly design the best deep neural network (DNN) for your data Train on multi-gpu (automatic) Visually monitor DNN training quality in real-time Manage training of many DNNs in parallel on multi-gpu systems developer.nvidia.com/digits 23

For the developer: DIGITS TM devbox Fastest Deskside Deep Learning Training Box Four TITAN X GPUs with 12GB of memory per GPU 1600W Power Supply Unit Ubuntu 14.04 NVIDIA-qualified driver NVIDIA CUDA Toolkit 7.0 NVIDIA DIGITS SW Caffe, Theano, Torch, BIDMach 24

For the datacenter: multi-gpu servers For accelerated training Facebook Big Sur Opencompute server 25

8x Faster Caffe Performance TESLA M40 World s Fastest Accelerator for Deep Learning CPU Tesla M40 Reduce Training Time from 8 Days to 1 Day 0 1 2 3 4 5 6 7 8 9 # of Days CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory Bandwidth Power 12 GB 288 GB/s 250W Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 26

What else can one do with Deep Learning? 27

Action Recognition 28

Segmentation Clement Farabet, Camille Couprie, Laurent Najman and Yann LeCun: Learning Hierarchical Features for Scene Labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, August, 2013 https://www.youtube.com/watch?v=kknhdlns13u 29

Image Captioning Automated Image Captioning with ConvNets and Recurrent Nets Andrej Karpathy, Fei-Fei Li 30

Playing Games Google s DeepMind https://www.youtube.com/watch?v=v1eynij0rnk 31

Resources 32

Want to try? Links and resources Deep Learning https://developer.nvidia.com/deep-learning Hands-on labs https://nvidia.qwiklab.com/ Question? Email jbernauer@nvidia.com 33