Deep Learning GPU-Based Hardware Platform Hardware and Software Criteria and Selection Mourad Bouache Yahoo! Performance Engineering Group Sunnyvale, CA +1.408.784.1446 bouache@yahoo-inc.com John Glover Yahoo! Performance Engineering Group Sunnyvale, CA +1.408.349.7511 glover@yahoo-inc.com TUTORIAL: ICS-2016, Istanbul, Turkey May 31st 2016 ABSTRACT Deep Learning, as part of a wider set of machine learning methods, consists of layers of integrated computations and comparisons to draw correlations from many different representations of data and is inherently highly compute intensive. In stride with the advancements of CPU technology on core count and speed, the time to compute or process these data representations has decreased and is further mitigated by the use of parallel computing techniques. But as we scale out this parallel compute we start to hit the limits of what commodity and even standard enterprise servers can support to the point that pushing for more cores begins to require custom solutions to chip-set support, CPU inter-connectivity, and power consumption. This can be extremely cost intensive and, depending on your data sets and associated revenues, may not deliver the best value of cost to compute. Advances in compute technologies may have caused a recent resurgence of interest in Deep Learning, in particular Graphic Processing Unit (GPU) technology. There are many options in the GPU space and the compute architecture of these cores are highly suited for floating point number crunching as well as the Matrix and Vector mathematics involved in Deep Learning. In our work evaluating how different framework algorithms utilize compute hardware for Deep Learning we tested several different GPU supporting servers from several vendors. This integrated system cluster required us to make careful hardware selections. In this Workshop, we, as the Performance Engineering Group (PEG), will describe our experience with Deep Learning Software and Hardware Objectives at Yahoo, as well as the selection and installation of hardware and software for research purposes for Yahoo Engineering Teams. Presenter s Biographies Mourad Bouache Performance Engineering and research - PhD in Computer Architecture. I am working at Yahoo for almost 4 years doing Performance Engineering. I develop benchmarks code and profiling tools for code optimization and performance boost, identify bottlenecks and optimal BIOS, Software and Hardware configuration for different applications. My goal is to introduce new technologies based on identified needs, modern CPU, GPU and Phi coprocessor for the intensive compute used in different Big Data applications at Yahoo. I work with software developers closely to improve code base performance, reduce resource consumption and shorten request latency. Develop coolest tools to monitor billions of users requests. Compare the compute resource offerings in terms of workload that Yahoo use for cloud computing.
John Glover Computer Hardware Research and Performance Engineer - Server HW/SW configuration and installation, lab/network configuration, infrastructure and performance test project management, and performance analysis. C/C++ software development, and shell scripting. ~10 years experience evaluating current and emerging computer hardware technologies including Intel, ARM, and AMD s latest processor architectures, DDR memory architectures, RAID controllers, and storage components such as NVMe and Solid State Drives (SSD) to support enterprise computer infrastructure. Evaluation includes performance testing and power consumption management for application and storage servers to ensure system and network compatibility. I also create and maintain hardware vendor relationships to procure pre-production evaluation units for lab testing and application testing in our data centers. Workshop Description In our work evaluating how different framework algorithms utilize compute hardware for Deep Learning we tested several different GPU supporting servers from several vendors. With the need for x16 data lanes for GPU connection, the PCIe Bus became the limiting factor for scaling out cores. We selected the Dell C4130 due to it s optional PCIe switch that allows for up to 4 x16 lane slots directly connected to the CPU s PCIe bus for reduced latency. We built a cluster of 24x Dell C4130 servers, each running quad nvidia Tesla K80 GPU cards, with interconnected memory access over Remote Direct Memory Access (RDMA) with the Mellanox ConnectX-4 network cards using the InfiniBand connection interface. To further optimize the shared GPU memory access we implemented nvidia s GPUDirect software. This integrated system cluster required us to make careful hardware selections. In this workshop we will describe our experience with Deep Learning Software and Hardware objectives at Yahoo, as well as the selection and installation of hardware and software for research purposes for Yahoo Engineering Teams. We will cover the following topics as a method for one to evaluate Machine Learning objectives as a means to design the best computer hardware for the use case. We will also touch on our vendor discussions and research needed to test and integrate some of the latest GPU and network interfaces in an effort to show the thought process behind the hardware selections. Topics include: Deep Learning Framework algorithm/hardware utilization As a method of processing many representation of information, the creation of a Deep Learning system can have many different implementations. The Machine Learning community has created several frameworks to help somewhat standardize the software implementation and training of a Deep Learning or Neural Network systems. One of the most widely used frameworks for image processing, which we will discuss in this paper, is Caffe; Deep Learning is an emerging topic in artificial intelligence. As a subcategory of machine learning, Deep Learning leverages the use of neural networks to improve the processing of data for use in applications like image recognition, computer vision, audio recognition, and natural language processing. With myriad uses for artificial intelligence in enterprise and cloud applications it's quickly becoming one of the most sought-after fields in computer science. But how did it evolve from an obscure academic topic into one of the technology s most exciting fields in the industry? Part of the answer is the surge in big data use-cases from companies like Yahoo, Facebook, Google, and even Walmart. The other part could simply be the renewed interest in machine learning technologies spurred by new and more cost efficient compute technologies using GPUs.
Current Hardware options (CPU vs GPU) Graphic Processing Units (GPUs) are ideal for Deep Learning, speeding a process that could otherwise take a year or more to just weeks or days. That performance boost is mostly due to the architecture of the GPU cores which are designed to perform many calculations at once or in parallel. And once a system is trained, with GPUs, scientists and researchers can put that learning to work. That work involves tasks that not too long ago were once thought impossible. Speech recognition is one application that has been flourishing as well as real-time voice translation from one language to another. Other researchers are building systems that analyze text data using word definitions and patterns to determine the sentiment, or overall feel, in social media conversations. Installation of Hardware Platform To build a GPU cluster at Yahoo, the first important thing is a fast network connection between your servers and using Message Passing Interface (MPI) in your programming will make things much easier than to use the options available in CUDA itself. Another thing is power consumption. Every GPU (nvidia Tesla K40) can consume 300 Watt of power. For the Deep Learning project we used Dell C4130 server supporting only 1600W of power with 4x GPUs (see the figure 8). Running Caffe on these servers, power can hit 1800W. In general, GPUs can consume substantial power when loaded with compute intensive workloads. The power consumption of the server configuration in the Deep Learning cluster is significantly higher (2.9x to 3.3x) compared to CPU-only runs; this is due to the four K80 GPUs. Install the software (Caffe/ Intel MKL,... ) Caffe is a Deep Learning framework created with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. Caffe is released under the BSD 2- Clause license. Caffe s expressive architecture encourages application and innovation by design. It allows users to interface with their systems using models and optimization that are defined by configuration without hard-coding. One of the key features for testing and training the system is a simple switch between CPU and GPU processing by setting a single flag to train on a GPU machine and then deploy to clusters of commodity servers or mobile devices. At Yahoo, Caffe is a generic term, most of the time, it means the software created at BVLC (see section 3). ycaffe is a Yahoo internal version of the public BVLC Caffe. Deep Learning team created ycaffe package by the following steps: 1. Request all the dependent libraries, such as Intel Math Library Kernel, nvidia Cuda, etc. So when a user yinst install ycaffe, those dependent libraries will be automatically installed if not present. 2. Include additional Java and Scala code, so that we can run Caffe in Yahoo Hadoop environment and launch it via Spark. 3. Additional features added to the original BVLC Caffe C++ code to meet our customer's requirement. For example, some sparse matrix operations are supported in our version. 4. Multi-GPU support. 5. More new features are expected in ycaffe. GPUDirect and RDMA NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of nvidia GPUs to expose portions of device memory on a PCI Express Base Address Register region. Remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits highthroughput, low-latency networking, which is especially useful in massively parallel computer clusters.
How GPUDirect RDMA works? Network topology using the Infiniband Technology Infiniband (IB) is a high-performance low-latency interconnection network commonly employed in High- Performance Computing (HPC). The IB standard specifies different link speed grades, such as QDR (40Gb/s) and FDR (56Gb/s). IB network interfaces support sophisticated hardware capabilities, like RDMA, and a variety of communication APIs, like IB Verbs [6] which is widely used in MPI implementations. For example, thanks to RDMA, an IB network interface can access the memory of remote nodes without any involvement of the remote CPU (one-sided communication). GPUDirect RDMA is an API between IB CORE and peer memory clients, such as nvidia Kepler class GPU's. It provides access for the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. This capability is supported with Mellanox ConnectX-4 or Connect-IB InfiniBand adapters used in our study (see figure 8). It will also work seamlessly using RoCE technology with the Mellanox ConnectX-4 adapters. Below the key features: Accelerated communication with network and storage devices Network and GPU device drivers can share pinned (page-locked) buffers, eliminating the need to make a redundant copy in CUDA host memory. Peer-to-Peer Transfers between GPUs Use high-speed DMA transfers to copy data between the memories of two GPUs on the same system/pcie bus. Peer-to-Peer memory access Optimize communication between GPUs using NUMA-style access to memory on other GPUs from within CUDA kernels. RDMA Eliminate CPU bandwidth and latency bottlenecks using remote direct memory access (RDMA) transfers between GPUs and other PCIe devices, resulting in significantly improved MPISendRecv efficiency between GPUs and other nodes) GPUDirect for Video Optimized pipeline for frame-based devices such as frame grabbers, video switchers, HD-SDI capture, and CameraLink devices. GPUDirect RDMA API over Infiniband CORE and peer memory clients, such as NVIDIA Tesla GPU's. This provides access for the HCA to read/write peer memory data buffers, as a result it allows RDMA-based applications
to use the peer device computing power with the RDMA interconnect without the need to copy data to host memory. This capability is supported with Connect InfiniBand adapters. It will also work seamlessly using RoCE technology with the Mellanox ConnectX-3 VPI adapters. Performance differences with and without GPUDirect GPU-accelerated clusters and workstations are widely recognized for providing the tremendous horsepower required by compute-intensive workloads. Compute-intensive applications can provide even faster results with nvidia GPUDirect. Using GPUDirect, multiple GPUs, third party network adapters (see figure 6), solid-state drives (SSDs) and other devices can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on nvidia Tesla products. NVIDIA GPUDirect Peer-to-Peer (P2P) Communication Between GPUs GPUDirect includes a family of technologies that is continuously being evolved to increase performance and expand its usability. First introduced in June 2010, GPUDirect Shared Access supports accelerated communication with third party PCI Express device drivers via shared pinned host memory (see figure 7). In 2011, the release of GPUDirect Peer to Peer added support for Transfers and direct load and store Access between GPUs on the same PCI Express root complex. Announced in 2013, GPU Direct RDMA enables third party PCI Express devices to directly access GPU bypassing CPU host memory altogether. Test with different DL platform (GoogleNet, AlexNet,...) AlexNet trained a large, Deep Convolutional Neural Network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, they achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, they used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers they employed a recently-developed regularization method called dropout that proved to be very effective. They also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection. GoogLeNet uses 12 fewer parameters
than the winning architecture, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision. We are introducing these two Convolutional Neural Networks, AlexNet and GoogLeNet, as we are going test them in our environment using the CPU and different GPU modes. AlexNet using dual GPU has 1.8x of speedup, using quad GPUs 2.9x and finally with 8 GPUs the result is 3.4x of speedup. Now with GoogLeNet, using dual GPU, the speed is better than AlexNet dual GPU and it s equal to 1.9x of speedup, quad GPUs with 3.2x and finally 8 GPUs give 4.5x speedup. We are guessing for 8 GPUs is 6.6x speedup with batch of 128. The results we have are going to be very similar using the RDMA with a GPU peering, the transfer time is less than 10% of the compute time. We don't have final numbers yet. Also NVIDIA has a couple pull request on my work that increases performance by 5 to 25% depending on the type of the job, this will be part of the flickr team future works. Target Audience: Machine Learning enthusiasts, Software and Hardware Architects, Developers interested in Software and Hardware Interaction, Computer Science, Computer Engineering, Software Engineering.