Build GPU Cluster Hardware for Efficiently Accelerating CNN Training. YIN Jianxiong Nanyang Technological University jxyin@ntu.edu.

Similar documents
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Intel Xeon Processor E5-2600

FLOW-3D Performance Benchmark and Profiling. September 2012

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

HP ProLiant SL270s Gen8 Server. Evaluation Report

Pedraforca: ARM + GPU prototype

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

High Performance Computing in CST STUDIO SUITE

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Cloud Data Center Acceleration 2015

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

OpenMP Programming on ScaleMP

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Parallel Programming Survey

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

RDMA over Ethernet - A Preliminary Study

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

ICRI-CI Retreat Architecture track

How System Settings Impact PCIe SSD Performance

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Advancing Applications Performance With InfiniBand

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Current Status of FEFS for the K computer

Scaling from Datacenter to Client

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

PCI Express IO Virtualization Overview

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Intel Xeon +FPGA Platform for the Data Center

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Findings in High-Speed OrthoMosaic

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Solid State Storage in Massive Data Environments Erik Eyberg

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Lustre & Cluster. - monitoring the whole thing Erich Focht

IBM System x SAP HANA

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Auto-Tunning of Data Communication on Heterogeneous Systems

Deep Learning GPU-Based Hardware Platform

PCI Express Impact on Storage Architectures and Future Data Centers

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Know your Cluster Bottlenecks and Maximize Performance

Stream Processing on GPUs Using Distributed Multimedia Middleware

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Networking Virtualization Using FPGAs

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

SMB Direct for SQL Server and Private Cloud

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Rethinking SIMD Vectorization for In-Memory Databases

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

præsentation oktober 2011

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Evaluation of 40 Gigabit Ethernet technology for data servers

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

HUAWEI Tecal E6000 Blade Server

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Overview: X5 Generation Database Machines

NVM Express TM Infrastructure - Exploring Data Center PCIe Topologies

Platfora Big Data Analytics

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

PCI Express and Storage. Ron Emerick, Sun Microsystems

HPC Update: Engagement Model

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Michael Kagan.

Sun Constellation System: The Open Petascale Computing Architecture

HP Z Workstations graphics card options

Infrastructure Matters: POWER8 vs. Xeon x86

HP Workstations graphics card options

Software-defined Storage at the Speed of Flash

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Overview of HPC Resources at Vanderbilt

LABS. Boston Solutions. Future Intel Xeon processor E v3 product families. September Powered by

Couchbase Server: Accelerating Database Workloads with NVM Express*

Optimizing Shared Resource Contention in HPC Clusters

HUAWEI Tecal E9000 Converged Infrastructure Blade Server

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Transcription:

Build Cluster Hardware for Efficiently Accelerating CNN Training YIN Jianxiong Nanyang Technological University jxyin@ntu.edu.sg

Visual Object Search Private Large-scale Visual Object Database Domain Specifi Object Search Engine Singapore Smart City Plan

Deep Learning Dedicated Cluster 2013 2014 2015 2016 Servers w/ 2-24x E5-2630 - 5x K20m - 4x Titan Black - 4x GTX 780 4x 4 Server - 8x E5-2630v2-16x K40 8 Server Test Cluster - 4x E5-2650v2-16x K40-4x K80-8x Titan X - 36x IB FDR Switch Expanded 8 Server Cluster - 12x E5-2650v2-16x K40-8x K80-8x M60-8x Titan X - 36x FDR IB Switch

Parallel CNN Training is Important CNN go deeper and complicated Stopping Single Thread Performance increment

Multi- node Parallel Training supportive Frameworks Multi- in Single Server Multi- across Servers Data Parallelism *Model Parallelism Training Algorithm Caffe Yes No Yes No DNN Training Framework CNTK Yes Yes Yes No HPC Facility Architecture MXNet Yes Yes Yes Yes Tensor Flow Yes Yes Yes No

Workflow and Job Handlers Mini-Batch Preparation Mini-Batch Preparation prepares mini-batch Host Memory Mini-Batch Caching Mini-Batch Caching Caches Mini-batch samples Mini-Batch Loading Mini-Batch Loading Handle mini-batch Preparation Forward Forward Forward Forward Forward Forward s compute Forward Training Interconnects Synchronization AllReduce Loss Sync or Collect + Broadcast by or (Slower) Backward Backward Backward Backward Backward Backward s compute Backward Training

Typical HPC architecture Typical HPC Hardware Typical HPC Cluster Host RAM Host RAM Host RAM Host RAM Host RAM Host RAM...

Optimal HPC Architecture Optimal HPC Hardware dense HPC Cluster Host RAM Host RAM Host RAM Host RAM PCI- E Switch PCI- E Switch PCI- E Switch PCI- E Switch High : ratio (>2) High density Direct P2P links Prioritized Cooling #3 #2 #1 #0 #5 #4 #7 #6 #3 #2 #1 #0 #5 #4 #7 #6 Lower TCO Monetary Cost Lower Maintenance Cost Better Performance

Key Performance Impactors and Corresponding Hardware Data Parallelism Model Parallelism Single Computing Capacity Optimized Algo, e.g., cudnn, fbfft Feed in example# Sync Timing After every forward computing After each layer is computed PP = Sync Frequency Synced Msg body Synced Msg size Proportional to # of iterations The trained model, Mostly large payload, > 1MB, Proportional to # of layers x # of iterations Output Activation Volume of each layer Very tiny payload, ~ 4KB TT < TT TT + TT + TT CNN Model Structure Memory Size Sync Message Payload size Sync Frequency - Interconnect Topology Intra/inter -node - Link capacity Traffic Optimization

Card

Must- have features Software Features W/ updated cudnnv4 Support W/ Direct P2P/RDMA W/ NVIDIA Collective Communication Library support Hardware Features Single Precision Performance Matters Larger Read Cache preferred Maxwell cards strongly preferred over Kepler or other previous gen.

Multi- training reduces per Card VRAM footprint, enables large model training VRAM Usage Decomposition Batch Size Dependent Model Structure Dependent H/W Setup: S/W Setup: Training Config: 2x Intel Xeon E5-2650v2, 2x 8GT/s QPI 192GB, 1TB SATA-III SSD, 8x NVIDIA Geforce Titan X Mellanox ConnectX-3 Pro FDR IB adapter, Ubuntu 14.04.3, CUDA 7.5, Driver: 352.39, 3.0-OFED.3.0.2.0.0.1, GDR, cudnnv4 enabled;; NN: Inception-bn, batch size=128, lr_factor=0.94, lr=0.045, synchronized training, num_classess=1000

Memory And System Throughput Larger memory allows larger Batch Size, which pushes to higher utilization. MXNet 8x K80 Chip, Batch Size=512 Model: VggNet MXNet 8x K80 Chip, Batch Size=256 Model: VggNet Idling Time

Throughput And Scalability K40 (12GB GDDR5, 384bit) Titan X (12GB GDDR5, 384bit) Tesla M40 / Quadro M6000, 24GB VRAM

Form Factors and Availability Right Cooler for Multi- Setup When identical in specs, passive cooling cards are: 1) less inter-card heat interference, increases stableness Passive Cooling Active Cooling 2) Lower operation temperature 3) no Clock speed throttling 4) Cooling failure resistant

- Interconnect

When and How to Sync Reducing Time Transfer Overhead Reduce Handling Processor Sync Payload Size Sync Schedule & Rules Data Path Capacities handling handling Hybrid handling >1MB Large <4KB Small Traffic Optimize Asynchronous Synchronos Frequent Sync Long Interval Sync Hybrid Sync Mainboard IO layout Bandwidth Latency Hardware Software Software Hardware

vs Reducer When D2D and H2D/D2H bandwidth is similar, Overhead Of Reduce-Broadcast Overhead Of LARC AllReduce by +s Overhead Of LARD AllReduce by s In practise, BW H2D/D2H ~= 1.5 x BW D2D 140 120 100 80 60 40 20 0 Reduce Handler Benchmarking w/ K40 PIX PXB QPI Hybrid

Extend System to Multiple Nodes Key Issues: Inter- Node Links only have: 1) much smaller link BW(6GB/s), 2) longer latency (up to 10x more than PCI- E link), 3) limited interface BW to system, thus suffers from low intra- node BW to peer cards. Pipeline Compute & Sync: - Sacrifice Immediate Consistency - Hurts Accuracy Traffic Optimization - Local Aggregation - Message Data Compression H/W Arch Innovation - Faster Inter-node links - Bottleneck-removing P2P alternatives

Intra- node P2P Topologies Host RAM Host RAM Host RAM Host RAM Host RAM Host RAM PCI- E Switch PCI- E Switch Topo-1 Topo-2 Topo-3 Topo-4

Both Link Bandwidth and Latency Matter for Data Parallelism Typical Root NodeTraffic over IB (caffe variation) H/W Setup: 2x Intel Xeon E5-2650v2, 2x 8GT/s QPI 192GB, 1TB SATA-III SSD, NVIDIA Tesla K40, Mellanox ConnectX-3 Pro FDR IB adapter, S/W Setup: Ubuntu 14.04.3, CUDA 7.5, Driver: 352.39, 3.0-OFED.3.0.2.0.0.1, GDR, cudnnv4 enabled;; Training Config: NN: Inception-bn, batch size=128, lr_factor=0.94, lr=0.045, synchronized training, num_classess=1000

NUMA based or Chassis based NUMA-based Chassis-based H/W Setup: S/W Setup: Training Config: 2x Intel Xeon E5-2650v2, 2x 8GT/s QPI 192GB, 1TB SATA-III SSD, 8x NVIDIA Tesla K40, Mellanox ConnectX-3 Pro FDR IB adapter, Ubuntu 14.04.3, CUDA 7.5, Driver: 352.39, 3.0-OFED.3.0.2.0.0.1, Direct P2P/RDMA, cudnnv4 enabled;; NN: Inception-bn, batch size=128, lr_factor=0.94, lr=0.045, synchronized training, num_classess=1000

Other Major Components

Pipelined Handled IO overhead Training Time Iteration #1 Initialize Training Prepar e Batch Iteration #2 Forward Prepare Batch Sync & Combine Loss Waiting... Backward Forward Sync & Combine Loss Backward Computing Computing Iteration #3 Prepare Batch Waiting... Forward Sync & Combine Loss Backward Phase#1 Phase#2 Phase#3 Phase#4 Iteration #4 Prepare Batch Waiting... Phase#3.1 Phase#3.2 Phase#3.3 Transfer Loss Combine Loss Distribute Combined Loss *Phase#3 handling is implementation dependent, e.g., MXNET supports both and **The length above are not proportional to the actual time spent on each sub-task. The actual time cost depends on structure of the model.

[1] Selection [2] Specs Empirical Min Suggestions Elaborations series Xeon or Core does not matter, Xeon Processor preferred because it has better MB. micro architect ure Ivy Bridge EP or higher architecture processor, Sandy Bridge not recommended[1]. Sandy Bridge Processor s throughput is capped to 0.8GB/s to 5.56GB/s for small and large message respectively. PCIe lane with 40 PCI-E lanes Even for EP architect processors, there is still low PCIe lane model. [3] core 2x threads for each chip mainly handles IO requests, and decoding of batch sample. clock Base clock 2.4GHz, Up to 3.2GHz IO handling overhead could be hidden using pipeline, clock effects [1] Efficient Data Movement on accelerated clusters,(gtc2014) [2] MVAPICH2: A High Performance MPI Library for NVIDIA Clusters with InfiniBand (GTC 2013) [3] Tim Dettmer: http://timdettmers.com/2015/03/09/deep- learning- hardware- guide/

Host RAM & Storage Selection Host RAM: Total Amount: 2x Total VRAM size Specs: DDR3-1066 and 1866 hardly makes difference Coz H2D memory IO is hidden by pipeline Storage: Implementation Dependent SSD is good for check- point saving (CPS) speed, which is fairly important for fault- tolerant NFS storage hurts CPS speed Local SATA Drive preferred save PCIe Lane resource Save extra RAM budget, and invest on SSD, card, better layout & cooling efficient servers.

Multi- - friendly Server Airflow Design is major power horse

Wrap- up

Design Goals and System Arch Mapping Requirement Input Relevant Key Specs Major Metrics Throughput Scalability Monetary Unit Computing Capacity Node IO layout;; Inter node link capacity;; Per throughput $;; Scale Up $;; Processor Density Architecture Clock Speed RAM Size Mem BW PCIe Version PCIe 3.0 16x slot Qty PCIe 3.0 8x slot Qty PCIe Switch Layout Link Latency PCIe 3.0 16x slot Qty versus socket Qty ratio

Wish list Architecture Scalability Avoid Short Board Link;; Partitioned Workload;; Balanced Traffic;; P2P Large BW & low latency intra / inter node(s) link NVLINK + EDR/XDR Maximize Performance Max Processor Utilization Large high BW Dedicated Local RAM Pascal + HBM Decoupled to- Storage Pipeline Throughput Hide overhead and delay NVLINK + IO and P2P Compute IO PCIe/CAPI

Server Node Recommendations Balanced Max Density EDR Ready 80GB/s G2G Link Host RAM Host RAM Host RAM Host RAM Host RAM Host RAM Host RAM Host RAM PIX PIX PIX PIX PCI- E Switch PIX PIX PIX PIX #7 #6 #5 #4 EDR Card EDR Card #3 #2 #1 #0 #3 #2 EDR Card #1 #0 #7 #6 #5 #4 #3 #2 #1 #0 #3 #2 #1 #0 P2P Link

ACKNOWLEDGEMENT Credits: Pradeep Gupta (NVIDIA), Project Manager;; Wang Xingxing (ROSE Lab, NTU, Singapore);; Xiao Bin (ex WeChat, Tencent), Funding Agencies Industry Partner Open-source Project:

Q & A