Deep Learning GPU-Based Hardware Platform



Similar documents
Advancing Applications Performance With InfiniBand

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Mellanox Academy Online Training (E-learning)

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Pedraforca: ARM + GPU prototype

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

SMB Direct for SQL Server and Private Cloud

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

RoCE vs. iwarp Competitive Analysis

LS DYNA Performance Benchmarks and Profiling. January 2009

FLOW-3D Performance Benchmark and Profiling. September 2012

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

High Performance Computing in CST STUDIO SUITE

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Choosing the Best Network Interface Card Mellanox ConnectX -3 Pro EN vs. Intel X520

Stream Processing on GPUs Using Distributed Multimedia Middleware

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

ECDF Infrastructure Refresh - Requirements Consultation Document

Enabling High performance Big Data platform with RDMA

How A V3 Appliance Employs Superior VDI Architecture to Reduce Latency and Increase Performance

ECLIPSE Performance Benchmarks and Profiling. January 2009

Intel DPDK Boosts Server Appliance Performance White Paper

Can High-Performance Interconnects Benefit Memcached and Hadoop?

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Recent Advances in HPC for Structural Mechanics Simulations

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

The search engine you can see. Connects people to information and services

Direct Scale-out Flash Storage: Data Path Evolution for the Flash Storage Era

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

RDMA over Ethernet - A Preliminary Study

NVIDIA GPUs in the Cloud

Cluster Implementation and Management; Scheduling

Scaling from Workstation to Cluster for Compute-Intensive Applications

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

Advanced analytics at your hands

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

PCI Express and Storage. Ron Emerick, Sun Microsystems

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Parallel Computing. Benson Muite. benson.

Recommended hardware system configurations for ANSYS users

Parallel Computing with MATLAB

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Dell* In-Memory Appliance for Cloudera* Enterprise

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

Cloud Data Center Acceleration 2015

Steven C.H. Hoi School of Information Systems Singapore Management University

HGST Virident Solutions 2.0

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

IBM Deep Computing Visualization Offering

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Building a cost-effective and high-performing public cloud. Sander Cruiming, founder Cloud Provider

Share and aggregate GPUs in your cluster. F. Silla Technical University of Valencia Spain

How To Build A Cloud Computer

Accelerating CFD using OpenFOAM with GPUs

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

IBM System x reference architecture solutions for big data

Xeon+FPGA Platform for the Data Center

Using PCI Express Technology in High-Performance Computing Clusters


Mellanox Accelerated Storage Solutions

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

DEEP LEARNING WITH GPUS

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Performance Architect Remote Storage (Intern)

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

HPC with Multicore and GPUs

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Building a Scalable Storage with InfiniBand

InfiniBand Update Addressing new I/O challenges in HPC, Cloud, and Web 2.0 infrastructures. Brian Sparks IBTA Marketing Working Group Co-Chair

ioscale: The Holy Grail for Hyperscale

HPC Software Requirements to Support an HPC Cluster Supercomputer

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU-accelerated Large Scale Analytics using MapReduce Model

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Findings in High-Speed OrthoMosaic

Installing Hadoop over Ceph, Using High Performance Networking

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Architecting Low Latency Cloud Networks

Scientific Computing Data Management Visions

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Auto-Tunning of Data Communication on Heterogeneous Systems

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Transcription:

Deep Learning GPU-Based Hardware Platform Hardware and Software Criteria and Selection Mourad Bouache Yahoo! Performance Engineering Group Sunnyvale, CA +1.408.784.1446 bouache@yahoo-inc.com John Glover Yahoo! Performance Engineering Group Sunnyvale, CA +1.408.349.7511 glover@yahoo-inc.com TUTORIAL: ICS-2016, Istanbul, Turkey May 31st 2016 ABSTRACT Deep Learning, as part of a wider set of machine learning methods, consists of layers of integrated computations and comparisons to draw correlations from many different representations of data and is inherently highly compute intensive. In stride with the advancements of CPU technology on core count and speed, the time to compute or process these data representations has decreased and is further mitigated by the use of parallel computing techniques. But as we scale out this parallel compute we start to hit the limits of what commodity and even standard enterprise servers can support to the point that pushing for more cores begins to require custom solutions to chip-set support, CPU inter-connectivity, and power consumption. This can be extremely cost intensive and, depending on your data sets and associated revenues, may not deliver the best value of cost to compute. Advances in compute technologies may have caused a recent resurgence of interest in Deep Learning, in particular Graphic Processing Unit (GPU) technology. There are many options in the GPU space and the compute architecture of these cores are highly suited for floating point number crunching as well as the Matrix and Vector mathematics involved in Deep Learning. In our work evaluating how different framework algorithms utilize compute hardware for Deep Learning we tested several different GPU supporting servers from several vendors. This integrated system cluster required us to make careful hardware selections. In this Workshop, we, as the Performance Engineering Group (PEG), will describe our experience with Deep Learning Software and Hardware Objectives at Yahoo, as well as the selection and installation of hardware and software for research purposes for Yahoo Engineering Teams. Presenter s Biographies Mourad Bouache Performance Engineering and research - PhD in Computer Architecture. I am working at Yahoo for almost 4 years doing Performance Engineering. I develop benchmarks code and profiling tools for code optimization and performance boost, identify bottlenecks and optimal BIOS, Software and Hardware configuration for different applications. My goal is to introduce new technologies based on identified needs, modern CPU, GPU and Phi coprocessor for the intensive compute used in different Big Data applications at Yahoo. I work with software developers closely to improve code base performance, reduce resource consumption and shorten request latency. Develop coolest tools to monitor billions of users requests. Compare the compute resource offerings in terms of workload that Yahoo use for cloud computing.

John Glover Computer Hardware Research and Performance Engineer - Server HW/SW configuration and installation, lab/network configuration, infrastructure and performance test project management, and performance analysis. C/C++ software development, and shell scripting. ~10 years experience evaluating current and emerging computer hardware technologies including Intel, ARM, and AMD s latest processor architectures, DDR memory architectures, RAID controllers, and storage components such as NVMe and Solid State Drives (SSD) to support enterprise computer infrastructure. Evaluation includes performance testing and power consumption management for application and storage servers to ensure system and network compatibility. I also create and maintain hardware vendor relationships to procure pre-production evaluation units for lab testing and application testing in our data centers. Workshop Description In our work evaluating how different framework algorithms utilize compute hardware for Deep Learning we tested several different GPU supporting servers from several vendors. With the need for x16 data lanes for GPU connection, the PCIe Bus became the limiting factor for scaling out cores. We selected the Dell C4130 due to it s optional PCIe switch that allows for up to 4 x16 lane slots directly connected to the CPU s PCIe bus for reduced latency. We built a cluster of 24x Dell C4130 servers, each running quad nvidia Tesla K80 GPU cards, with interconnected memory access over Remote Direct Memory Access (RDMA) with the Mellanox ConnectX-4 network cards using the InfiniBand connection interface. To further optimize the shared GPU memory access we implemented nvidia s GPUDirect software. This integrated system cluster required us to make careful hardware selections. In this workshop we will describe our experience with Deep Learning Software and Hardware objectives at Yahoo, as well as the selection and installation of hardware and software for research purposes for Yahoo Engineering Teams. We will cover the following topics as a method for one to evaluate Machine Learning objectives as a means to design the best computer hardware for the use case. We will also touch on our vendor discussions and research needed to test and integrate some of the latest GPU and network interfaces in an effort to show the thought process behind the hardware selections. Topics include: Deep Learning Framework algorithm/hardware utilization As a method of processing many representation of information, the creation of a Deep Learning system can have many different implementations. The Machine Learning community has created several frameworks to help somewhat standardize the software implementation and training of a Deep Learning or Neural Network systems. One of the most widely used frameworks for image processing, which we will discuss in this paper, is Caffe; Deep Learning is an emerging topic in artificial intelligence. As a subcategory of machine learning, Deep Learning leverages the use of neural networks to improve the processing of data for use in applications like image recognition, computer vision, audio recognition, and natural language processing. With myriad uses for artificial intelligence in enterprise and cloud applications it's quickly becoming one of the most sought-after fields in computer science. But how did it evolve from an obscure academic topic into one of the technology s most exciting fields in the industry? Part of the answer is the surge in big data use-cases from companies like Yahoo, Facebook, Google, and even Walmart. The other part could simply be the renewed interest in machine learning technologies spurred by new and more cost efficient compute technologies using GPUs.

Current Hardware options (CPU vs GPU) Graphic Processing Units (GPUs) are ideal for Deep Learning, speeding a process that could otherwise take a year or more to just weeks or days. That performance boost is mostly due to the architecture of the GPU cores which are designed to perform many calculations at once or in parallel. And once a system is trained, with GPUs, scientists and researchers can put that learning to work. That work involves tasks that not too long ago were once thought impossible. Speech recognition is one application that has been flourishing as well as real-time voice translation from one language to another. Other researchers are building systems that analyze text data using word definitions and patterns to determine the sentiment, or overall feel, in social media conversations. Installation of Hardware Platform To build a GPU cluster at Yahoo, the first important thing is a fast network connection between your servers and using Message Passing Interface (MPI) in your programming will make things much easier than to use the options available in CUDA itself. Another thing is power consumption. Every GPU (nvidia Tesla K40) can consume 300 Watt of power. For the Deep Learning project we used Dell C4130 server supporting only 1600W of power with 4x GPUs (see the figure 8). Running Caffe on these servers, power can hit 1800W. In general, GPUs can consume substantial power when loaded with compute intensive workloads. The power consumption of the server configuration in the Deep Learning cluster is significantly higher (2.9x to 3.3x) compared to CPU-only runs; this is due to the four K80 GPUs. Install the software (Caffe/ Intel MKL,... ) Caffe is a Deep Learning framework created with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2- Clause license. Caffe s expressive architecture encourages application and innovation by design. It allows users to interface with their systems using models and optimization that are defined by configuration without hard-coding. One of the key features for testing and training the system is a simple switch between CPU and GPU processing by setting a single flag to train on a GPU machine and then deploy to clusters of commodity servers or mobile devices. At Yahoo, Caffe is a generic term, most of the time, it means the software created at BVLC (see section 3). ycaffe is a Yahoo internal version of the public BVLC Caffe. Deep Learning team created ycaffe package by the following steps: 1. Request all the dependent libraries, such as Intel Math Library Kernel, nvidia Cuda, etc. So when a user yinst install ycaffe, those dependent libraries will be automatically installed if not present. 2. Include additional Java and Scala code, so that we can run Caffe in Yahoo Hadoop environment and launch it via Spark. 3. Additional features added to the original BVLC Caffe C++ code to meet our customer's requirement. For example, some sparse matrix operations are supported in our version. 4. Multi-GPU support. 5. More new features are expected in ycaffe. GPUDirect and RDMA NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of nvidia GPUs to expose portions of device memory on a PCI Express Base Address Register region. Remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits highthroughput, low-latency networking, which is especially useful in massively parallel computer clusters.

How GPUDirect RDMA works? Network topology using the Infiniband Technology Infiniband (IB) is a high-performance low-latency interconnection network commonly employed in High- Performance Computing (HPC). The IB standard specifies different link speed grades, such as QDR (40Gb/s) and FDR (56Gb/s). IB network interfaces support sophisticated hardware capabilities, like RDMA, and a variety of communication APIs, like IB Verbs [6] which is widely used in MPI implementations. For example, thanks to RDMA, an IB network interface can access the memory of remote nodes without any involvement of the remote CPU (one-sided communication). GPUDirect RDMA is an API between IB CORE and peer memory clients, such as nvidia Kepler class GPU's. It provides access for the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. This capability is supported with Mellanox ConnectX-4 or Connect-IB InfiniBand adapters used in our study (see figure 8). It will also work seamlessly using RoCE technology with the Mellanox ConnectX-4 adapters. Below the key features: Accelerated communication with network and storage devices Network and GPU device drivers can share pinned (page-locked) buffers, eliminating the need to make a redundant copy in CUDA host memory. Peer-to-Peer Transfers between GPUs Use high-speed DMA transfers to copy data between the memories of two GPUs on the same system/pcie bus. Peer-to-Peer memory access Optimize communication between GPUs using NUMA-style access to memory on other GPUs from within CUDA kernels. RDMA Eliminate CPU bandwidth and latency bottlenecks using remote direct memory access (RDMA) transfers between GPUs and other PCIe devices, resulting in significantly improved MPISendRecv efficiency between GPUs and other nodes) GPUDirect for Video Optimized pipeline for frame-based devices such as frame grabbers, video switchers, HD-SDI capture, and CameraLink devices. GPUDirect RDMA API over Infiniband CORE and peer memory clients, such as NVIDIA Tesla GPU's. This provides access for the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications

to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. This capability is supported with Connect InfiniBand adapters. It will also work seamlessly using RoCE technology with the Mellanox ConnectX-3 VPI adapters. Performance differences with and without GPUDirect GPU-accelerated clusters and workstations are widely recognized for providing the tremendous horsepower required by compute-intensive workloads. Compute-intensive applications can provide even faster results with nvidia GPUDirect. Using GPUDirect, multiple GPUs, third party network adapters (see figure 6), solid-state drives (SSDs) and other devices can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on nvidia Tesla products. NVIDIA GPUDirect Peer-to-Peer (P2P) Communication Between GPUs GPUDirect includes a family of technologies that is continuously being evolved to increase performance and expand its usability. First introduced in June 2010, GPUDirect Shared Access supports accelerated communication with third party PCI Express device drivers via shared pinned host memory (see figure 7). In 2011, the release of GPUDirect Peer to Peer added support for Transfers and direct load and store Access between GPUs on the same PCI Express root complex. Announced in 2013, GPU Direct RDMA enables third party PCI Express devices to directly access GPU bypassing CPU host memory altogether. Test with different DL platform (GoogleNet, AlexNet,...) AlexNet trained a large, Deep Convolutional Neural Network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, they achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, they used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers they employed a recently-developed regularization method called dropout that proved to be very effective. They also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection. GoogLeNet uses 12 fewer parameters

than the winning architecture, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision. We are introducing these two Convolutional Neural Networks, AlexNet and GoogLeNet, as we are going test them in our environment using the CPU and different GPU modes. AlexNet using dual GPU has 1.8x of speedup, using quad GPUs 2.9x and finally with 8 GPUs the result is 3.4x of speedup. Now with GoogLeNet, using dual GPU, the speed is better than AlexNet dual GPU and it s equal to 1.9x of speedup, quad GPUs with 3.2x and finally 8 GPUs give 4.5x speedup. We are guessing for 8 GPUs is 6.6x speedup with batch of 128. The results we have are going to be very similar using the RDMA with a GPU peering, the transfer time is less than 10% of the compute time. We don't have final numbers yet. Also NVIDIA has a couple pull request on my work that increases performance by 5 to 25% depending on the type of the job, this will be part of the flickr team future works. Target Audience: Machine Learning enthusiasts, Software and Hardware Architects, Developers interested in Software and Hardware Interaction, Computer Science, Computer Engineering, Software Engineering.