NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief



Similar documents
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

NVIDIA VIDEO ENCODER 5.0

Accelerating CFD using OpenFOAM with GPUs

High Performance Computing in CST STUDIO SUITE

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

NVIDIA GeForce GTX 750 Ti

High Performance Computing and Big Data: The coming wave.

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Summit and Sierra Supercomputers:

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

L20: GPU Architecture and Models

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

NVIDIA Jetson TK1 Development Kit

Simulation Platform Overview

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Binomial option pricing model. Victor Podlozhnyuk

HP ProLiant SL270s Gen8 Server. Evaluation Report

NVIDIA Tesla Compute Cluster Driver for Windows

The Fastest, Most Efficient HPC Architecture Ever Built

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Case Study on Productivity and Performance of GPGPUs

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

PLGrid Infrastructure Solutions For Computational Chemistry

Introduction to GPU Programming Languages

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

TESLA C2050/2070 COMPUTING PROCESSOR INSTALLATION GUIDE

NVIDIA GeForce Experience

Technical Brief. DualNet with Teaming Advanced Networking. October 2006 TB _v02

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPUs for Scientific Computing

Whitepaper. NVIDIA Miracast Wireless Display Architecture

Parallel Computing with MATLAB

The Value of High-Performance Computing for Simulation

TESLA K20X GPU ACCELERATOR

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

TESLA M2075 DUAL-SLOT COMPUTING PROCESSOR MODULE

TESLA M2050 AND TESLA M2070 DUAL-SLOT COMPUTING PROCESSOR MODULES

QUADRO POWER GUIDELINES

Technical Brief. MediaShield Storage Technology: Confidently Storing Your Digital Assets. March 2007 TB _v03

OpenCL Programming for the CUDA Architecture. Version 2.3

Introduction to GPU hardware and to CUDA

DUAL MONITOR DRIVER AND VBIOS UPDATE

ST810 Advanced Computing

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

NVIDIA GRID DASSAULT CATIA V5/V6 SCALABILITY GUIDE. NVIDIA Performance Engineering Labs PerfEngDoc-SG-DSC01v1 March 2016

NVIDIA Tesla. GPU Computing Technical Brief. Version /24/07

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

HPC with Multicore and GPUs

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Monte-Carlo Option Pricing. Victor Podlozhnyuk

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

GPU Acceleration of the SENSEI CFD Code Suite

Keys to node-level performance analysis and threading in HPC applications

Intel Platform and Big Data: Making big data work for you.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Intelligent Business Operations

The Hartree Centre helps businesses unlock the potential of HPC

Data Centric Systems (DCS)

Technical Brief. NVIDIA nview Display Management Software. May 2009 TB _v02

TESLA K20 GPU ACCELERATOR

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Parallel Programming Survey

GPU Accelerated Signal Processing in OpenStack. John Paul Walters. Computer Scien5st, USC Informa5on Sciences Ins5tute

THE AMD MISSION 2 AN INTRODUCTION TO AMD NOVEMBER 2014

SGI HPC Systems Help Fuel Manufacturing Rebirth

NVIDIA CUDA INSTALLATION GUIDE FOR MICROSOFT WINDOWS

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

A quick tutorial on Intel's Xeon Phi Coprocessor

NVIDIA GRID K1 GRAPHICS BOARD

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

Turbomachinery CFD on many-core platforms experiences and strategies

TESLA K10 GPU ACCELERATOR

Multicore Parallel Computing with OpenMP

Trends in High-Performance Computing for Power Grid Applications

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

REMOTE VISUALIZATION ON SERVER-CLASS TESLA GPUS

Deep Learning GPU-Based Hardware Platform

Recent Advances in HPC for Structural Mechanics Simulations

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

NVIDIA GRID K2 GRAPHICS BOARD

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Data Centric Interactive Visualization of Very Large Data

Parallel Computing. Introduction

revolutionising high performance computing

Transcription:

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

NVIDIA changed the high performance computing (HPC) landscape by introducing its Fermibased GPUs that delivered an impressive leap in performance and energy-efficiency while offering a parallel programming model, CUDA, as an extension to industry standard languages like C, C++, and Fortran. Now NVIDIA has introduced the new Tesla K20-K20X GPU Accelerators that further advance the HPC industry. Built on the revolutionary Kepler architecture, these accelerators redefine the standard for energy-efficient computing and feature innovative technologies like SMX, Hyper-Q, and Dynamic Parallelism to boost application performance by up to 10x. SMX: 3x More Performance Per Watt The new SMX (Next Generation Streaming Multiprocessor) is an architectural innovation designed from the ground-up to deliver high efficiency performance. With SMX at its core, Tesla K20/K20X accelerators deliver the industry s highest single and double precision performance- 3.95 teraflops and 1.31 teraflops respectively for Tesla K20X- at an unprecedented 93% computational efficiency. Teraflop 1.5 1.0 0.5 0.0 Double Precision Performance (DGEMM) 1.22 0.43 0.17 Xeon E5-2687WTesla M2090 (Fermi) Tesla K20X Teraflop Single Precision Performance 3.0 (SGEMM) 2.0 2.9 1.0 0.89 0.0 0.35 Xeon E5-2687W Tesla M2090 Tesla K20X (Fermi) Hyper-Q: Easy Speed-up for Legacy MPI Codes Many legacy MPI codes don t generate enough work to fully occupy the GPU because they were written for CPU cores. Rather than requiring developers to refactor their codes to put more workload per MPI process, the Hyper-Q feature reduces efforts considerably because developers can now throw up to 32 MPI processes with small and medium-sized workloads at a shared GPU. Speedup vs. Dual K20 20x 15x 10x 5x CP2K-Quantum Chemistry K20 with Hyper-Q K20 without Hyper-Q 2.5x To illustrate the power of Hyper-Q, we picked a traditionally difficult code for GPUs called CP2K, a popular MPI-based quantum chemistry code, showing more than 2x performance improvement when 16 MPI ranks are run on the shared GPU. 0x 0 5 10 15 20 Number of GPUs

Dynamic Parallelism: Simplifying Parallel Programming Dynamic Parallelism allows the GPU to operate more autonomously from the CPU by generating new work for itself at runtime, from inside a kernel. The concept is simple, but the impact is powerful: it can make programming easier, particularly for algorithms traditionally considered difficult such as divide-and-conquer problems. Relative Sorting Performance Quicksort Tesla K20X Performance Advantage over Sandy Bridge CPUs 4x 3x 2x 1x 2x Without Dynamic Parallelism With Dynamic Parallelism 0x To showcase its potential, we used 0 5 10 Dynamic Parallelism on Quicksort, a well- Problem Size (Million of Elements) known algorithm for sorting methods, to reduce the lines of CUDA code in half while improving performance by 2x. K20X Benchmarks by Application Accelerating Key Scientific Applications by up to 10x Today, hundreds of applications take advantage of GPU acceleration, spanning all scientific disciplines and engineering domains, and the number of applications continues to grow. In the past year alone, the number of CUDA-accelerated applications has grown by over 60%. When Tesla K20 GPU Accelerators are added to servers with Sandy Bridge CPUs, CUDAenabled applications are typically accelerated up to 10x. The K20X benchmark below shows single node performance of leading applications in various science domains. CPU results: Dual socket E5-2687w; GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600k CPU vs. Tesla K20 GPU

Here s a closer look at three of the applications listed above. Chroma: High Energy & Nuclear Physics Chroma is used by scientists to test alternatives to the Standard Model of physics in order to develop a deeper understanding of the fundamental properties of matter and energy. The benchmark below show significant performance boost when adding one or two Tesla K20X GPU Accelerators to dual socket CPUs. 24 3 x128 Lattice Volume, BiCSTAB solver 20x 18.0x Relat iv e to 2x CP U 16x 14.6x 12x 8x 4x 0x 1.0x 7.5x 9.4x 2xCPU 2xCPU+ 2xCPU+ 2xCPU+ 2xCPU+ 1xM2090 2xM2090 1xK20X 2xK20X CPU results: Dual socket E5-2687w; GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs, 64GB per node SPECFEM3D: Earth Sciences Researchers use SPECFEM3D for a deeper understanding of the phenomena that shape seismic activity, thereby allowing engineers to assess potential hazards and to build structural countermeasures. This code was a Gordon Bell winner in 2008 prior to the GPU work, so it was a highly tuned code even before the recent work to accelerate on GPUs.

U ive elat R 16.0x x CP12.0x 8.0x 256x128 3D Spatial Discretization 5.4x 9.3x to 28.8x 4.0x 0.0x 1.0x 12.8x 2xCPU 2xCPU+ 2xCPU+ 2xCPU+ 2xCPU+ 1xM2090 2xM2090 1xK20X 2xK20X CPU results: Dual socket E5-2687w; GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs, 64GB per node AMBER: Molecular Dynamics Molecular dynamics (MD) allows the study of biological and chemical systems at the atomistic level on timescales from femtoseconds to milliseconds. Numerous software packages exist for conducting MD simulations of which one of the most widely used is AMBER. With the Tesla K20X GPU Accelerators, acceleration by up to 80% is observed on AMBER, over compared to Tesla M2090 GPUs. 10.0x SPFP-JAC_production_NVE Relative to 2x CPU 8.0x 6.0x 4.0x 2.0x 1.0x 3.4x 4.6x 7.1x 8.2x 0.0x 2xCPU 2xCPU+ 2xCPU+ 2xCPU+ 2xCPU+ 1xM2090 2xM2090 1xK20X 2xK20X CPU results: Dual socket E5-2687w; GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs, 64GB per node

Large Cluster Scaling for GPU-Accelerated Applications Large clusters around the world are increasingly deploying GPUs to accelerate their workload. In the Top500 list, from June 2011 to June 2012, number of GPU-accelerated systems grew by over 400%. Whether an application is GPU-accelerated or not, designing code to scale well across many nodes has never been an easy task. However, many GPU-accelerated applications are scaling well as developers work on extracting more parallelism by allocating more parallel work within a node and use MPI to distribute GPU workloads. WL-LSMS: Material Science WL-LSMS simulates the magnetic behavior of materials at the atomic and nanoscale in order to design lightweight, powerful magnetic components for highly efficient electric motors, generators, and magnetic storage devices. WL-LSMS was a Gordon Bell winner in 2009, so it was already a highly-tuned CPU code prior to the recent work to accelerate on GPUs. 40x 32 Atom Emsemble of Iron (Fe) Relative to a Single CPU only Node Cray XK7 -Tesla K20X 30x Cray XK7-CPU 20x 6 10x x 0x 0 20 40 60 80 100 120 140 160 180 200 # Sockets QMCPACK: Material Science QMCPACK is used by researchers to develop new insights into condensed matter physics, materials science, and chemistry by using more physically accurate methods. For large systems on the order of 200 electrons or more, simulations that traditionally would take upwards of 12 hours or more can now run on the order of 3 hours per simulation.

Compute Efficiency 1500000 Cray XK7 Tesla K20X Cray XK7 CPU 3x3x1 Graphite 1000000 4x 500000 0 0 500 1000 1500 2000 2500 # of Compute Nodes NAMD: Molecular Dynamics NAMD (Not (just) Another Molecular Dynamics program) is a free-of-charge molecular dynamics simulation package written using the Charm++ parallel programming model, noted for its parallel efficiency and often used to simulate large systems (millions of atoms). NAMD also scales well across many nodes accelerated with GPUs. ns/day 100x STMV 2.0 1.5 Cray XK7 K20X Cray XK7 CPU 1.0 4x 0.5 0.0 128 256 512 768 # of Compute Nodes For more information on Tesla K20/K20X GPU Accelerators, visit www.nvidia.com/tesla.

Notice ALL INFORMATION PROVIDED IN THIS TECHNICAL BRIEF, INCLUDING COMMENTARY, OPINION, NIVIDA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS ) ARE BEING PROVIDED AS IS. NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMEN, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or system without express written approval of NVIDIA Corporation. Copyright 2012 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, Tesla, Kepler, and CUDA are trademarks and / or registered trademarks of NVIDIA Corporation. All company and product names are trademarks or registered trademarks of the respective owners with which they are associated. Features, pricing, availability, and specifications are all subject to change without notice. NOV 2012 www.nvidia.com