Overview of High Performance Computing



Similar documents
HPC Wales Skills Academy Course Catalogue 2015

10- High Performance Compu5ng

Parallel Programming Survey

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Introduction to GPU hardware and to CUDA

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

GPUs for Scientific Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Computing. Benson Muite. benson.

Parallel Scalable Algorithms- Performance Parameters

Cellular Computing on a Linux Cluster

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Chapter 2 Parallel Architecture, Software And Performance

Spring 2011 Prof. Hyesoon Kim

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

An Introduction to Parallel Computing/ Programming

Building a Top500-class Supercomputing Cluster at LNS-BUAP

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

on an system with an infinite number of processors. Calculate the speedup of

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Introduction to Cloud Computing

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Scalability and Classifications

Performance metrics for parallel systems

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GPU Programming Languages

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Lecture 2 Parallel Programming Platforms

Formal Metrics for Large-Scale Parallel Performance

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Parallel Algorithm Engineering

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CUDA programming on NVIDIA GPUs

Accelerating CFD using OpenFOAM with GPUs

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

CHAPTER 1 INTRODUCTION

Stream Processing on GPUs Using Distributed Multimedia Middleware

Performance Metrics for Parallel Programs. 8 March 2010

Why the Network Matters

High Performance Computing in the Multi-core Area

System Models for Distributed and Cloud Computing

HPC enabling of OpenFOAM R for CFD applications

Guided Performance Analysis with the NVIDIA Visual Profiler

Pedraforca: ARM + GPU prototype

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Petascale Visualization: Approaches and Initial Results

Advancing Applications Performance With InfiniBand

Parallel Large-Scale Visualization

Systolic Computing. Fundamentals

Chapter 1 Computer System Overview

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

COSCO 2015 Heterogeneous Computing Programming

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

HPC with Multicore and GPUs

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Parallel Computing with MATLAB

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CS 575 Parallel Processing

A survey on platforms for big data analytics

Parallelism and Cloud Computing

OpenMP Programming on ScaleMP

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

High Performance Computing. Course Notes HPC Fundamentals

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

High Performance Computing in CST STUDIO SUITE

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Petascale Software Challenges. William Gropp

Large-Scale Reservoir Simulation and Big Data Visualization

Performance of the JMA NWP models on the PC cluster TSUBAME.

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Data Center Switch Fabric Competitive Analysis

The Methodology of Application Development for Hybrid Architectures

Parallel programming with Session Java

CMSC 611: Advanced Computer Architecture

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

2: Computer Performance

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

White Paper The Numascale Solution: Extreme BIG DATA Computing

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

HP ProLiant SL270s Gen8 Server. Evaluation Report

Performance metrics for parallelism

Components: Interconnect Page 1 of 18

Performance Characteristics of Large SMP Machines

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

ultra fast SOM using CUDA

HPC Software Requirements to Support an HPC Cluster Supercomputer

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

HPC Deployment of OpenFOAM in an Industrial Setting

Overlapping Data Transfer With Application Execution on Clusters

Communicating with devices

Transcription:

Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://geco.mines.edu/workshop 1

This tutorial will cover all three time slots. In the first session we will discuss the importance of parallel computing to high performance computing. We will by example, show the basic concepts of parallel computing. The advantages and disadvantages of parallel computing will be discussed. We will present an overview of current and future trends in HPC hardware. The second session will provide an introduction to MPI, the most common package used to write parallel programs for HPC platforms. As tradition dictates, we will show how to write "Hello World" in MPI. Attendees will be shown how to and allowed to build and run relatively simple examples on a consortium resource. The third session will briefly discuss other important HPC topics. This will include a discussion of OpenMP, hybrid programming, combining MPI and OpenMP. Some computational libraries available for HPC will be highlighted. We will briefly mention parallel computing using graphic processing units (GPUs). 2

Today s Overview HPC computing in a nutshell? Basic MPI - run an example A few additional MPI features A Real MPI example Scripting OpenMP Libraries and other stuff 3

Introduction What is parallel computing? Why go parallel? When do you go parallel? What are some limits of parallel computing? Types of parallel computers Some terminology 4

What is Parallelism? Consider your favorite computational application One processor can give me results in N hours Why not use N processors -- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem 5

Parallel computing is computing by committee Parallel computing: the use of multiple computers or processors working together on a common task. Each processor works on its section of the problem Grid of a Problem to be Solved Process 0 does work for this region Process 1 does work for this region Processors are allowed to exchange information with other processors Process 2 does work for this region Process 3 does work for this region 6

Why do parallel computing? Limits of single CPU computing Available memory Performance Parallel computing allows: Solve problems that don t fit on a single CPU Solve problems that can t be solved in a reasonable time 7

Why do parallel computing? We can run Larger problems Faster More cases Run simulations at finer resolutions Model physical phenomena more realistically 8

Weather Forecasting Atmosphere is modeled by dividing it into three-dimensional regions or cells 1 mile x 1 mile x 1 mile (10 cells high) about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time. About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step 10 day forecast with 10 minute resolution => 1.5x10 14 flop 100 Mflops would take about 17 days 1.7 Tflops would take 2 minutes 17 Tflops would take 12 seconds 9

Modeling Motion of Astronomical bodies (brute force) Each body is attracted to each other body by gravitational forces. Movement of each body can be predicted by calculating the total force experienced by the body. For N bodies, N - 1 forces / body yields N 2 calculations each time step A galaxy has, 10 11 stars => 10 9 years for one iteration Using a N log N efficient approximate algorithm => about a year NOTE: This is closely related to another hot topic: Protein Folding 10

Types of parallelism two extremes Data parallel Each processor performs the same task on different data Example - grid problems Bag of Tasks or Embarrassingly Parallel is a special case Task parallel Each processor performs a different task Example - signal processing such as encoding multitrack data Pipeline is a special case 11

Simple data parallel program Example: integrate 2-D propagation problem Starting partial differential equation: Finite Difference Approximation: PE #0 PE #1 PE #2 PE #3 y PE #4 PE #5 PE #6 PE #7 x 12

Typical Task Parallel Application DATA Normalize Task FFT Task Multiply Task Inverse FFT Task Signal processing Use one processor for each task Can use more processors if one is overloaded This is a pipeline 13

Parallel Program Structure Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel End work 1d work 2d work (N)d 14

Parallel Problems Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel Start Serial Section work 1d work 2d Subtasks don t finish together work (N)d Serial Section (No Parallel Work) work 1x work 2x work (N)x End Serial Section start parallel work 1y work 2y work (N)y work 1z work 2z work (N)z Not using all processors End Parallel End 15

A Real example #!/usr/bin/env python from sys import argv from os.path import isfile from time import sleep from math import sin,cos # fname="message" my_id=int(argv[1]) print my_id, "starting program" # if (my_id == 1): sleep(2) myval=cos(10.0) mf=open(fname,"w") mf.write(str(myval)) mf.close() else: myval=sin(10.0) notready=true while notready : if isfile(fname) : notready=false sleep(3) mf=open(fname,"r") message=float(mf.readline()) mf.close() total=myval**2+message**2 else: sleep(5) print my_id, "done with program" 16

Theoretical upper limits All parallel programs contain: Parallel sections Serial sections Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) Serial sections limit the parallel effectiveness If you have a lot of serial computation then you will not get good speedup No serial work allows perfect speedup Amdahl s Law states this formally 17

Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t p = (f p /N + f s )t s Effect of multiple processors on speed up 1 Where Fs = serial fraction of code Fp = parallel fraction of code N = number of processors Perfect speedup t=t1/n or S(n)=n S = t s t p = fp /N + f s 18

Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 19

Amdahl s Law Vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance 80 70 60 fp = 0.99 50 40 30 Amdahl's Law Reality 20 10 0 0 50 100 150 200 250 Number of processors 20

Sometimes you don t get what you expect! 21

Some other considerations Writing effective parallel application is difficult Communication can limit parallel efficiency Serial time can dominate Load balance is important Is it worth your time to rewrite your application Do the CPU requirements justify parallelization? Will the code be used just once? 22

Parallelism Carries a Price Tag Parallel programming Involves a steep learning curve Is effort-intensive Parallel computing environments are unstable and unpredictable Don t respond to many serial debugging and tuning techniques May not yield the results you want, even if you invest a lot of time Will the investment of your time be worth it? 23

Terms related to algorithms Amdahl s Law (talked about this already) Superlinear Speedup Efficiency Cost Scalability Problem Size Gustafson s Law 24

Superlinear Speedup S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation. One common reason for superlinear speedup is the extra cache in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow memory traffic. 25

Efficiency Efficiency = Execution time using one processor over the Execution time using a number of processors Its just the speedup divided by the number of processors 26

Cost The processor-time product or cost (or work) of a computation defined as Cost = (execution time) x (total number of processors used) The cost of a sequential computation is simply its execution time, t s. The cost of a parallel computation is t p x n. The parallel execution time, t p, is given by t s /S(n) Hence, the cost of a parallel computation is given by Cost-Optimal Parallel Algorithm One in which the cost to solve a problem on a multiprocessor is proportional to the cost 27

Scalability Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalability. 28

Problem size Problem size: the number of basic steps in the best sequential algorithm for a given problem and data set size Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. However, doubling the date set size would not necessarily double the number of computational steps. It will depend upon the problem. For example, adding two matrices has this effect, but multiplying matrices quadruples operations. Note: Bad sequential algorithms tend to scale well 29

Other names for Scaling Strong Scaling (Engineering) For a fixed problem size how does the time to solution vary with the number of processors Weak Scaling How the time to solution varies with processor count with a fixed problem size per processor 30

Some Classes of machines Network Processor Processor Processor Processor Memory Memory Memory Memory Distributed Memory Processors only Have access to their local memory talk to other processors over a network 31

Some Classes of machines Uniform Shared Memory (UMA) Processor Processor All processors have equal access to Memory Processor Processor Memory Processor Processor Can talk via memory Processor Processor 32

Some Classes of machines Hybrid Shared memory nodes connected by a network... 33

Some Classes of machines More common today Each node has a collection of multicore chips... Ra has 268 nodes 256 quad core dual socket 12 dual core quad socket 34

Some Classes of machines Hybrid Machines Add special purpose processors to normal processors Not a new concept but, regaining traction Example: our Tesla Nvidia node, cuda1 "Normal" CPU Special Purpose Processor FPGA, GPU, Vector, Cell... 35

Network Topology For ultimate performance you may be concerned how you nodes are connected. Avoid communications between distant node For some machines it might be difficult to control or know the placement of applications 36

Network Terminology Latency How long to get between nodes in the network. Bandwidth How much data can be moved per unit time. Bandwidth is limited by the number of wires and the rate at which each wire can accept data and choke points 37

Ring 38

Grid Wrapping produces torus 39

Tree Fat tree the lines get wider as you go up 40

Hypercube 110 111 100 101 010 011 000 001 3 dimensional hypercube 41

4D Hypercube 1100 1110 1101 1111 1000 1010 1001 1011 0100 0110 0101 0111 0000 0010 0001 0011 Some communications algorithms are hypercube based How big would a 7d hypercube be? 42

Star? Quality depends on what is in the center 43

Example: An Infiniband Switch Infiniband, DDR, Cisco 7024 IB Server Switch - 48 Port Adaptors. Each compute node has one DDR 1-Port HCA 4X DDR=> 16Gbit/sec 140 nanosecond hardware latency 1.26 microsecond at software level 44

Measured Bandwidth 45