Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp



Similar documents
Petascale Software Challenges. William Gropp

Multicore Parallel Computing with OpenMP

Overlapping Data Transfer With Application Execution on Clusters

MPI and Hybrid Programming Models. William Gropp

High Performance Computing. Course Notes HPC Fundamentals

Introduction to Cloud Computing

Part I Courses Syllabus

Scientific Computing Programming with Parallel Objects

Building an Inexpensive Parallel Computer

An Introduction to Parallel Computing/ Programming

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Parallel Computing. Benson Muite. benson.

CS252 Graduate Computer Architecture Spring 2014 Lecture 1: Introduc<on

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

HPC Wales Skills Academy Course Catalogue 2015

The Asynchronous Dynamic Load-Balancing Library

Program Optimization for Multi-core Architectures

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

CHAPTER 1 INTRODUCTION

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Symmetric Multiprocessing

Course Development of Programming for General-Purpose Multicore Processors

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Advanced Operating Systems CS428

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

White Paper The Numascale Solution: Extreme BIG DATA Computing

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps


Mesh Generation and Load Balancing

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

A Lab Course on Computer Architecture

Scalability evaluation of barrier algorithms for OpenMP

Software and the Concurrency Revolution

Turbomachinery CFD on many-core platforms experiences and strategies

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Data Centric Systems (DCS)

Fast Multipole Method for particle interactions: an open source parallel library component

Parallel Computing for Data Science

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

MAQAO Performance Analysis and Optimization Tool

Lecture 1 Introduction to Parallel Programming

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Incorporating Multicore Programming in Bachelor of Science in Computer Engineering Program

Load Balancing using Potential Functions for Hierarchical Topologies

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Lecture 23: Multiprocessors

Keys to node-level performance analysis and threading in HPC applications

Matrix Multiplication

Large Vector-Field Visualization, Theory and Practice: Large Data and Parallel Visualization Hank Childs + D. Pugmire, D. Camp, C. Garth, G.

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

Chapter 18: Database System Architectures. Centralized Systems

MOSIX: High performance Linux farm

Parallel Algorithm Engineering

GridSolve: : A Seamless Bridge Between the Standard Programming Interfaces and Remote Resources

Client/Server and Distributed Computing

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Algorithm for Dense Matrix Multiplication

CUDA programming on NVIDIA GPUs

White Paper The Numascale Solution: Affordable BIG DATA Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Multi-GPU Load Balancing for Simulation and Rendering

HPC performance applications on Virtual Clusters

MEng, BSc Applied Computer Science

Cellular Computing on a Linux Cluster

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

INTEL Software Development Conference - LONDON High Performance Computing - BIG DATA ANALYTICS - FINANCE

OpenACC Programming and Best Practices Guide

Lecture 2 Parallel Programming Platforms

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CMSC 611: Advanced Computer Architecture

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Multi-core Programming System Overview

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Advances in Smart Systems Research : ISSN : Vol. 3. No. 3 : pp.

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Transcription:

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of the MPI parallel programming system We ll learn more about this in this course Creator of the MPICH implementation of MPI Creator of the PETSc library for scalable solution of PDEs on massively parallel processors We ll talk about the ideas behind this library and similar scalable tools 2

What is this Course About? Making effective use of the most powerful parallel computer systems for problems in science and engineering Doing this requires paying attention to every part of the parallel system It also requires a scientific and rigorous approach to performance 3

Course Goals Understand the sources of performance in modern architectures How to establish performance expectations Not precise performance predictions How to diagnose performance problems How to design algorithms and applications to maximize performance potential How to exploit parallelism at all levels 4

Course Organization: Lectures Lectures: Will primarily be prerecorded, available at least a few days before class You will often be expected to view these before the scheduled class We may review these during class time Lectures may include questions for you to check your comprehension of the material. Make sure you can apply the ideas before going on. Lectures are typically 10-40minutes So you will often be assigned two or more lectures before each class 5

Course Organization: In Class Time Will primarily be used for discussion, questions, and in-class assignments Weekly assignments, including machine problems Class is Wednesday and Fridays, 11-12:15, Siebel 1304 6

Other Logistics Course Online Presence TA Moodle (start at https://bw-course.ncsa.illinois.edu ) Lectures (video and slide), assignments, discussion, etc. here Xiao Chen xchen116@illinois.edu Office hours TBA 7

Why This Organization? Evidence that conventional lectures are not the best way to instruct, Particularly for a graduate class where backgrounds are varied Two other institutions are sharing this class: University of North Dakota University of Wyoming The Video lectures Permit you to study the material, including followup and pause to check information, on your schedule You should take advantage of the opportunity to checkout suggested readings and try your own experiments 8

Evaluation Homework every week Includes machine problems as well as paper problems Project Students will form small (2-3 person) teams to explore scalability and performance for a problem of interest Project proposals due in March Final report due at end of term 9

Sample Projects Model the performance of a halo exchange and improve a simple implementation Modify a benchmark to take chip and node topology into account and compare to modeled performance Take an application that you are working on, analyze the I/O performance, and study one approach to improve that performance 10

More on Projects Feel free to propose something on which you are working as the project The best projects are the ones in which you have the most invested The purpose of the project proposal is to ensure that the project is not too large and not too small 11

My Schedule As my students know, I have a busy travel schedule I will lead as many of the class sessions as possible; shortening my trips where possible I am happy to schedule additional time to meet with you individually or in groups send me email For our remote sites, we can arrange video sessions 12

Computer Resources Computer time will be made available to Illinois students on the Campus cluster (Taub) and for all students on Blue Waters Taub is a typical sized institutional cluster Not an extreme scale system, but sufficient for many experiments Some features of Taub make it a good platform to illustrate challenges J 13

Questions Write down answers to these questions and turn them in Do you know MPI (Message Passing Interface)? Do you consider yourself a beginner, intermediate, or expert? Do you know OpenMP? Do you consider yourself a beginner, intermediate, or expert? Which programming languages do you use? C? C++? Fortran? Python? Others? 14

Why Do We Need Extreme Scale Systems? Many problems cannot be solved exactly Even the three-body problem can only be solved for a few very special cases Apparently symmetric situations may not be E.g., evidence supernovae depend on fully 3D, non radially (or axially with rotating star) symmetric solution 15

Extreme Scale Computing Applies To A Broad Range Of Science And Engineering Disciplines Molecular Science Weather & Climate Forecas5ng Astrophysics Astronomy Earth Science Health Life Science Materials 16

What are the Limits of Extreme Scale Systems? PDE Simulations: 10 10 nodes common (about 2k cubed) Often unstructured mesh Molecular dynamics and n-body problems 10 8 10 10 particles common Discrete event simulations 10 9 and greater possible 17

Tentative Course Outline 1. Introduction 2. Extreme scale systems, Simple performance models 3. Benchmarks; performance models with sparse matrix-vector multiply 4. Cache memory; transpose 5. Memory hierarchy; blocking; Cache oblivious models 6. Processor execution 18

Tentative Course Outline 7. Vectors; Amdahl s law and n1/2 8. Moore s law and Dennard scaling; Multicore and Manycore 9. Threads and programming models 10. OpenMP basics; Loop parallelism 11. Task parallelism; Locks 12. Memory consistency; performance hazards 13. Distributed memory architecture 19

Tentative Course Outline 14. MPI Basics and performance models 15. Strategies for parallelism 16. MPI nonblocking and asynchronous communication; Progress 17. LogGP performance models 18. MPI Topology and interconnects 20

Tentative Course Outline 19. MPI Collectives and performance models 20. I/O and Parallel I/O 21. MPI I/O Basics 22. Different I/O organizations 23. MPI RMA and performance models 24. RMA put/get/accumulate operations; Atomic RMW 21

Tentative Course Outline 25. More on scaling; isoefficiency 26. Checkpointing basics 27. In memory checkpointing; fault models; performance models 22

Common Theme for Course Use as simple as possible models of computation (an execution model) to gain insight into performance issues Use both to diagnose issues and to design for performance One quick example that we ll revisit comparing scalable algorithms 23

Comparing Scalability Two algorithms run in parallel Perfect speedup is # of processors Algorithm 1 is (nearly) perfect Algorithm 2 is fading Which is best? 24

Scalability and Time Algorithm 1 has very poor uniprocessor performance Algorithm 2 has very good uniprocessor performance Not a (too) contrived case such examples have appeared in papers and proposals 25

Questions Why are you taking the class? Do you have a specific application in mind? What is the largest system you've run on? What is the longest job you've run (in terms of elapsed time from start to finish)? 26

How Do We Know if there is a Performance Problem? My application scales well! So what! Is it efficient? Making the scalar code more efficient decreases scalability How can we know? To what do we compare? In this class, we will develop techniques to answer this question and to guide in the development of high performance applications 27