Parallel Programming Survey

Similar documents
OpenMP and Performance

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Multi-Threading Performance on Commodity Multi-Core Processors

Performance Characteristics of Large SMP Machines

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

OpenMP Programming on ScaleMP

Trends in High-Performance Computing for Power Grid Applications

High Performance Computing. Course Notes HPC Fundamentals

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPU hardware and to CUDA

Case Study on Productivity and Performance of GPGPUs

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Algorithm Engineering

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Introduction to Cloud Computing

GPUs for Scientific Computing

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Overview of HPC Resources at Vanderbilt

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

A quick tutorial on Intel's Xeon Phi Coprocessor

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Symmetric Multiprocessing

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Lecture 2 Parallel Programming Platforms

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Chapter 2 Parallel Architecture, Software And Performance

High Performance Computing Infrastructure at DESY

Hadoop on the Gordon Data Intensive Cluster

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Auto-Tunning of Data Communication on Heterogeneous Systems

Lecture 23: Multiprocessors

Introduction to GPU Architecture

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Chapter 2 Parallel Computer Architecture

~ Greetings from WSU CAPPLab ~

Introduction to Microprocessors

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Lecture 1: the anatomy of a supercomputer

Parallel Programming

ANALYSIS OF SUPERCOMPUTER DESIGN

Cluster Implementation and Management; Scheduling

Kriterien für ein PetaFlop System

Introduction to GPGPU. Tiziano Diamanti

Sun Constellation System: The Open Petascale Computing Architecture

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Next Generation GPU Architecture Code-named Fermi

10- High Performance Compu5ng

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Parallel Firewalls on General-Purpose Graphics Processing Units

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

High Performance Computing, an Introduction to

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU Computing

1 Bull, 2011 Bull Extreme Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Multi-core and Linux* Kernel

Introduction to Parallel Programming (w/ JAVA)

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

benchmarking Amazon EC2 for high-performance scientific computing

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Embedded Systems: map to FPGA, GPU, CPU?

Extreme Computing: The Bull Way

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009

Building Clusters for Gromacs and other HPC applications

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

GPU Computing - CUDA

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Clustering Billions of Data Points Using GPUs

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Computer Graphics Hardware An Overview

GPU Parallel Computing Architecture and CUDA Programming Model

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Transcription:

Christian Terboven <terboven@itc.rwth-aachen.de> 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University

Agenda Overview: Processor Microarchitecture Shared-Memory Parallel Systems Distributed-Memory Parallel Systems (Cluster) General Purpose Graphic Processing Units (GPGPUs) Summary and Conclusions 2

Overview: Processor Microarchitecture 3

Performance Metrics FLOPS = Floating Point Operation per Second Megaflops = Gigaflops = Teraflops = Petaflops = 10 6 FLOPS 10 9 FLOPS 10 12 FLOPS 10 15 FLOPS just one number: Intel Westmere-EP at 3.06 GHz: 72 GFLOPs/sec Memory Bandwidth: Rate at which data can be read from or stored into a semiconductor memory by a processor. just one number: Intel Westmere-EP at 3.06 GHz: about 18 GB/s (1 socket) Memory Latency: Delay incurred when a processors accesses data inside the main memory (data not in the cache). just one number: Intel Westmere-EP at 3.06 GHz: about 18 ns (1 thread) 4

Single Processor System Processor Fetch program from memory Execute program instructions core Load data from memory Process data Write results back to memory Main Memory Store program memory Store data Input / Output is not covered here! 5

Memory Bottleneck There is a growing gap between core and memory performance: memory, since 1980: 1.07x per year improvement in latency single core: since 1980: 1.25x per year until 1986, 1.52x p. y. until 2000, 1.20x per year until 2005, then no change on a per-core basis 6 Source: John L. Hennessy, Stanford University, and David A. Patterson, University of California, September 25, 2012

Single System w/ Caches CPU is fast Order of 3.0 GHz Caches: Fast, but expensive Thus small, order of MB core on-chip cache off-chip cache Memory is slow Order of 20 ns latency Large, order of GB memory A good utilization of caches is crucial for good performance of HPC applications! 7

Moore s Law still holds (for a while)! The number of transistors on a chip is still increasing, but no longer the clock speed! Instead, we see many cores per chip. 8

Chip Multi-Threading (CMT) Traditional single-core processors can only process one thread at a time, spending a majority of time waiting for data from memory CMT refers to a processor s ability to process multiple software threads. Such capabilities can be implemented using a variety of methods, such as Having multiple cores on a single chip: Chip Multi-Processing (CMP) Executing multiple threads on a single core: Simultaneous Multi-Threading (SMT) A combination of both CMP and SMT. 9

Simultaneous Multi-Threading (SMT) Each core executes multiple threads simultaneously Typically there is one register set thread per thread But compute units are shared 10

Today: Multiple Multi-Threaded s Combination of CMP and SMT at work: 11

Shared-Memory Parallelism 12

Example for a SMP system Dual-socket Intel Woodcrest (dual-core) system Two cores per chip, 3.0 GHz Each chip has 4 MB of L2 cache on-chip, shared by both cores on-chip cache on-chip cache No off-chip cache bus Bus: Frontsidebus SMP: Symmetric Multi Processor memory Memory access time is uniform on all cores 13 Limited scalabilty

Example for a cc-numa system Dual-socket AMD Opteron (dual-core) system Two cores per chip, 2.4 GHz Each core has separate 1 MB of L2 cache on-chip on-chip cache on-chip cache on-chip cache on-chip cache No off-chip cache Interconnect: HyperTransport cc-numa: Memory access time is non-uniform memory interconnect memory Scalable (only if you do it right, as we will see) 14

Non-uniform Memory Serial code: all array elements are allocated in the memory of the NUMA node containing the core executing this thread double* A; A = (double*) malloc(n * sizeof(double)); on-chip cache on-chip cache on-chip cache on-chip cache for (int i = 0; i < N; i++) { } A[i] = 0.0; interconnect memory memory 15 A[0] A[N]

First Touch Memory Placement First Touch w/ parallel code: all array elements are allocated in the memory of the NUMA node containing the core executing the thread initializing the respective partition double* A; A = (double*) malloc(n * sizeof(double)); omp_set_num_threads(2); on-chip cache on-chip cache on-chip cache on-chip cache #pragma omp parallel for for (int i = 0; i < N; i++) { A[i] = 0.0; } interconnect memory memory 16 A[0] A[N/2] A[N/2] A[N]

Serial vs. Parallel Initialization Performance of OpenMP-parallel STREAM vector assignment measured on 2-socket Intel Xeon X5675 ( Westmere ) using Intel Composer XE 2013 compiler with different thread binding options: 17

Shared Memory Parallelization Memory can be accessed by several threads running on different cores in a multi-socket multi-core system: CPU 1 CPU 2 Programmable with c=3+a a a=4 OpenMP Look for tasks that can be executed simultaneously (task parallelism) 18

Distributed-Memory Parallelism 19

Example for a Cluster of SMP nodes Second level interconnect (network) is not cache-coherent Typically used in High Performane Computing: InfiniBand Latency: <= 5 us Bandwidth: >= 1200 MB/s Also used: GigaBit Ethernet: Latency: <= 60 us Bandwidth: >= 100 MB/s 2nd level interconnect (network) Latency: Time required to send a message of size zero (that is: time to setup the communication) 20 Bandwidth: Rate at which large messages (>= 2 MB) are transferred

Distributed Memory Parallelization Each process has it s own distinct memory Communication via Message Passing CPU 1 CPU 2 local memory Programmable with receive a a transfer a send a MPI Decompose data into distinct chunks to be processed independently (data parallelism) 21 Example: ventricular assist device (VAD)

Our HPC Cluster from Bull: Overview Admin IB Fabric / GigE Fabric Login (8/2) MPI S/L MPI S/L MPI S/L MPI S/L MPI S/L MPI S/L MPI S/L 75x Bullx B Chassis 1350 Nodes HPC Filesystem 1.5PB Home Filesystem 1.5PB ScaleMP SMP S/L SMP S/L SMP S/L SMP S/L 171x Bullx S/8 171 Nodes Quelle: Bull Group # Nodes Sum TFLOP Sum Memory # Procs per Node Memory per Node MPI-Small 1098 161 26 TB 2 x Westmere-EP 24 GB MPI-Large 252 37 24 TB (3,06 GHz, 6 s, 12 Threads) 96 GB SMP-Small 135 69 17 TB 4x Nehalem-EX 128 GB (2,0 GHz, 8 s, SMP-Large 36 18 18 TB Quelle: 512 Bull GB 22 16 Threads) ScaleMP-vSMP 8 gekoppelt 4 4 TB BCS: 4, vsmp: 16 4 TB gekoppelt

General Purpose Graphic Processing Units (GPGPUs) 23

GPGPUs GPGPUs = General Purpose Graphics Processing Units From fixed-function graphics pipeline to programmable processors for general purpose computations Programming paradigms CUDA, OpenCL, OpenACC, OpenMP 4.0, Main vendors NVIDIA, e.g. Quadro, Tesla, Fermi, Kepler AMD, e.g. FireStream, Radeon Manycore architecture 8 cores CPU 2880 cores GPU 24

Comparison CPU GPU Different design NVIDIA Corporation 2010 CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation 25

GPGPUs Execution model Thread Each thread is executed by a core Block Multiprocessor Each block is executed on a multiprocessor Several concurrent blocks can reside on one MP, depending on memory requirements/ resources Grid (Kernel) Device Each kernel is executed on the device 26

GPGPU Programming: Kernel and Threads Parallel portion of application is executed as kernel on the device Kernel is executed as an array of threads All threads execute the same code GPU-Threads 1000s simultaneously Lightweight, little creation overhead Fast switching Threads are grouped into blocks Blocks are grouped into a grid Programmable with Host Device Block Block Block Block 0 1 2 3 Kernel 1 Kernel 2 1D 2D Block 4 Block (0,0) Block (1,0) Block 5 Block (0,1) Block (1,1) Block 6 Block (0,2) Block (1,2) Block 7 CUDA / OpenACC Block (0,3) Block (1,3) Block (1,3) Thread Thread Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,0,0) Thread (1,0,0) Thread (2,0,0) Thread Thread (0,0,0) Thread (0,0,0) Thread (0,0,0) (0,1,0) (1,1,0) (2,1,0) 27

Summary and Conclusions 28

Summary and Conclusion With a growing number of cores per system, even the desktop or notebook has become a parallel computer. Only parallel programs will profit from new processors! SMP boxes are building blocks of large parallel systems. Memory hierarchies will grow further. CMP + SMT architecture is very promising for large, memory bound problems. Program optimization strategies: Memory Hierarchy Many s s per Socket Sun 2 cores Intel 2 cores AMD 2 cores Sun 8 cores IBM 4 cores Intel 80 cores (experimental ) Sun 8 cores AMD 4 cores Intel 4 cores 8/16 cores multi core 2004 2005 2006 2007 200x 201x O(1000) cores O(100) cores many core 29 Tree-like networks

The End Thank you for your attention. 30