HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Similar documents

Summit and Sierra Supercomputers:

NVIDIA GPUs in the Cloud

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

NVLink High-Speed Interconnect: Application Performance

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

High Performance Computing in CST STUDIO SUITE

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Accelerating CFD using OpenFOAM with GPUs

NVIDIA HPC Update. Carl Ponder, PhD; NVIDIA, Austin, TX, USA - Sr. Applications Engineer, Developer Technology Group

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, June 2016

HP ProLiant SL270s Gen8 Server. Evaluation Report

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Case Study on Productivity and Performance of GPGPUs

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Graphics Hardware An Overview

Jean-Pierre Panziera Teratec 2011

Intel Xeon Processor E5-2600

Hadoop on the Gordon Data Intensive Cluster

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Cloud Data Center Acceleration 2015

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

FAST DATA = BIG DATA + GPU. Carlo Nardone, Senior Solution Architect EMEA Enterprise UNITO, March 21 th, 2016

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Trends in High-Performance Computing for Power Grid Applications

Cluster Implementation and Management; Scheduling

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Recent Advances in HPC for Structural Mechanics Simulations

Crossing the Performance Chasm with OpenPOWER

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

CUDA in the Cloud Enabling HPC Workloads in OpenStack With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

CFD Implementation with In-Socket FPGA Accelerators

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Findings in High-Speed OrthoMosaic

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

InfiniBand Update Addressing new I/O challenges in HPC, Cloud, and Web 2.0 infrastructures. Brian Sparks IBTA Marketing Working Group Co-Chair

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Parallel Programming Survey

Pedraforca: ARM + GPU prototype

#OpenPOWERSummit. Join the conversation at #OpenPOWERSummit 1

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

ECDF Infrastructure Refresh - Requirements Consultation Document

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

GPGPU accelerated Computational Fluid Dynamics

Data Centric Systems (DCS)

ECLIPSE Performance Benchmarks and Profiling. January 2009

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Build an Energy Efficient Supercomputer from Items You can Find in Your Home (Sort of)!

LS DYNA Performance Benchmarks and Profiling. January 2009

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

Kriterien für ein PetaFlop System

The Foundation for Better Business Intelligence

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Overview of HPC Resources at Vanderbilt

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Infrastructure Matters: POWER8 vs. Xeon x86

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

High-Performance Computing and Big Data Challenge

Enabling High performance Big Data platform with RDMA

Deep Learning GPU-Based Hardware Platform

Parallel Computing. Introduction

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Overview: X5 Generation Database Machines

Cabot Partners. 3D Virtual Desktops that Perform. Cabot Partners Group, Inc. 100 Woodcrest Lane, Danbury CT 06810,

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015

Linux Cluster Computing An Administrator s Perspective

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Innovativste XEON Prozessortechnik für Cisco UCS

Supercomputing Clusters with RapidIO Interconnect Fabric

Parallel Computing with MATLAB

FLOW-3D Performance Benchmark and Profiling. September 2012

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

præsentation oktober 2011

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Supercomputing Resources in BSC, RES and PRACE

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

GPU Accelerated Signal Processing in OpenStack. John Paul Walters. Computer Scien5st, USC Informa5on Sciences Ins5tute

Next Generation GPU Architecture Code-named Fermi

Scaling from Workstation to Cluster for Compute-Intensive Applications

10- High Performance Compu5ng

Data Center and Cloud Computing Market Landscape and Challenges

Build GPU Cluster Hardware for Efficiently Accelerating CNN Training. YIN Jianxiong Nanyang Technological University

Transcription:

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing

US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance 10x in Scientific Applications 2017 Major Step Forward on the Path to Exascale 2

Just 4 nodes in Summit would make the Top500 list of supercomputers today Similar Power as Titan 5-10x Faster 1/5th the Size 150 PF = 3M Laptops One laptop for Every Resident in State of Mississippi 3

Optimizing Serial/Parallel Execution Application Code Parallel Work Serial Work GPU Majority of Ops System and Sequential Ops CPU + 4

NVLink-Enabled Heterogeneous Node 5x Higher Energy Efficiency 80-200 GB/s IBM POWER CPU Most Powerful Serial Processor NVIDIA NVLink Fastest CPU-GPU Interconnect NVIDIA Volta GPU Most Powerful Parallel Processor 5

NVLink: Logical Node Integration TESLA GPU NVLink 80 GB/s Power or ARM CPU 5x PCIe bandwidth Move data at CPU memory speed 3x lower energy/bit HBM 1 Terabyte/s Stacked Memory Throughput Optimized DDR4 50-75 GB/s DDR Memory Latency Optimized 6

KEPLER GPU PASCAL GPU NVLink NVLink High-speed GPU Interconnect POWER CPU NVLink PCIe PCIe X86 ARM64 POWER CPU 2014 X86 ARM64 POWER CPU 2016 7

NVLink Unleashes Multi-GPU Performance GPUs Interconnected with NVLink CPU Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe Speedup vs PCIe based Server 2.25x 2.00x PCIe Switch 1.75x TESLA GPU TESLA GPU 1.50x 1.25x 5x Faster than PCIe Gen3 x16 1.00x ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT 3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU 8 configuration 8 AMBER Cellulose (256x128x128), FFT problem size (256^3)

Two Computing Models For Accelerators Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work Heterogeneous Computing Model Complementary Processors Work Together Xeon Phi (And Others) Many Weak Serial Cores CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 9

Amdahl s Law Analysis 98% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 1 2 3 4 5 6 7 8 9 10 Minutes Run Time 10

Amdahl s Law Analysis 90% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 1 2 3 4 5 6 7 8 9 10 Minutes Run Time 11

Amdahl s Law Analysis 80% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 1 2 3 4 5 6 7 8 9 10 Minutes Run Time 12

Amdahl s Law Analysis 70% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 2 4 6 8 10 12 14 Minutes Run Time 13

Amdahl s Law Analysis 60% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 2 4 6 8 10 12 14 16 18 Minutes Run Time 14

Amdahl s Law Analysis 50% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 Minutes Run Time 15

Amdahl s Law Analysis 40% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 Minutes Run Time 16

Amdahl s Law Analysis 30% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 Minutes Run Time 17

Amdahl s Law Analysis 20% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 35 Minutes Run Time 18

Amdahl s Law Analysis 10% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 35 40 Minutes Run Time 19

TESLA K80 WORLD S FASTEST ACCELERATOR FOR DATA ANALYTICS AND SCIENTIFIC COMPUTING Dual-GPU Accelerator for Max Throughput 2x Faster 2.9 TF 4992 Cores 480 GB/s 25x 20x 15x Deep Learning: Caffe Double the Memory Designed for Big Data Apps 24GB K40 12GB Maximum Performance Dynamically Maximize Performance for Every Application 10x 5x 0x CPU Tesla K40 Tesla K80 Oil & HPC Gas Viz Data Analytics Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 20

Performance Lead Continues to Grow Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS 3500 GB/s 600 3000 K80 500 K80 2500 400 2000 K40 300 K40 1500 1000 500 M1060 K20 M2090 Ivy Bridge Sandy Bridge Westmere Haswell 200 100 M1060 K20 M2090 Sandy Bridge Westmere Haswell Ivy Bridge 0 2008 2009 2010 2011 2012 2013 2014 0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU 21

15x 10x Faster than CPU on Applications K80 CPU 10x 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry Physics CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 22

Tesla Platform Enables Optimization Scalable Nodes, ISA Choice NVLink x86 23

Tesla Platform Enables Optimization Ecosystem Industry Standard CPUs and Interconnects ARM64 POWER x86 Ethernet Cray NVIDIA GPU InfiniBand Others Industry-Driven Solutions 24

CORAL Scalable Heterogeneous Node NVLink In Practice Approximately 3,400 nodes, each with: IBM POWER9 CPUs and multiple NVIDIA Tesla Volta GPUs CPUs and GPUs integrated on-node with high speed NVLink Large coherent memory: over 512 GB (HBM + DDR4) All directly addressable from the CPUs and GPUs An additional 800 GB of NVRAM, burst buffer or as extended memory Over 40 TF peak performance/node(!) 25

Optimized Heterogeneous Node CORAL Application Performance Projections 26

YOUR PLATFORM FOR DISCOVERY Tesla Accelerated Computing