CS 575 Parallel Processing

Similar documents

Introduction to Cloud Computing

High Performance Computing. Course Notes HPC Fundamentals

Parallel Computing. Benson Muite. benson.

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Scalability and Classifications

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Big Data Systems CS 5965/6965 FALL 2015

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Parallel Programming Survey

Parallel Computing for Data Science

2: Computer Performance

Next Generation GPU Architecture Code-named Fermi

Principles and characteristics of distributed systems and environments

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Analysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program

Parallel Algorithm Engineering

Chapter 2 Parallel Architecture, Software And Performance

White Paper The Numascale Solution: Extreme BIG DATA Computing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Performance Characteristics of Large SMP Machines

Big Graph Processing: Some Background

Parallel Scalable Algorithms- Performance Parameters

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Performance metrics for parallelism

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

List of courses MEngg (Computer Systems)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

64-Bit versus 32-Bit CPUs in Scientific Computing

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Assessment Plan for CS and CIS Degree Programs Computer Science Dept. Texas A&M University - Commerce

The Running Time of Programs

Multicore Programming with LabVIEW Technical Resource Guide

PARALLEL PROGRAMMING

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

How To Understand The Concept Of A Distributed System

Parallelism and Cloud Computing

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Program Optimization for Multi-core Architectures

High Performance Computing

Recommended hardware system configurations for ANSYS users

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

MPI and Hybrid Programming Models. William Gropp

An Introduction to Parallel Computing/ Programming

Partitioning and Divide and Conquer Strategies

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Control 2004, University of Bath, UK, September 2004

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Lecture 2 Parallel Programming Platforms

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Petascale Software Challenges. William Gropp

Spring 2011 Prof. Hyesoon Kim

Systolic Computing. Fundamentals

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

DARPA, NSF-NGS/ITR,ACR,CPA,

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

CS473 - Algorithms I

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Module: Software Instruction Scheduling Part I

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Lecture 2: Universality

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Chapter 18: Database System Architectures. Centralized Systems

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU hardware and to CUDA

Load Imbalance Analysis

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Physical Data Organization

GRID SEARCHING Novel way of Searching 2D Array

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

Lecture 1 Introduction to Parallel Programming

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Operation Count; Numerical Linear Algebra

UTS: An Unbalanced Tree Search Benchmark

Information Processing, Big Data, and the Cloud

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Trends in High-Performance Computing for Power Grid Applications

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Transcription:

CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.

Course Topics Introduction, Background Orders of magnitude, Recurrences Models of Parallel Computing, communication Performance, Speedup, Efficiency Parallel Algorithms Dense Linear Algebra Sorting Graphs Search Fast Fourier Transform CS575 lecture 1 2

Course Organization Course reorganization Unite 575, 575dl Modernize: more // algorithms, GPUs We have separate course streams in networking and distributed systems Check the web page regularly Course organization is described on the web www.cs.colostate.edu/~cs575dl let's go look... Project changes regularly to stay fresh second half of the course GPUs/CUDA CS575 lecture 1 3

Cost effective Parallel Computing Off the shelf, commodity processors are very fast Memory is very cheap Building a processor that is a small factor faster costs an order of magnitude more Clusters: Cheapest way to get more performance: multiprocessor NoW: Networks of workstations Datacenters employ O(100K) simple processors with cheap interconnects Workstation can be an SMP Shared memory, Bus or Crossbar (eg. Cray) CS575 lecture 1 4

Wile E. Coyote s Parallel Computer Get a lot of the fastest processors Get a lot of memory per processor Get the fastest network Hook it all together And then what??? CS575 lecture 1 5

Now you gotta program it! Parallel programming introduces: CS575 lecture 1 6

Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling CS575 lecture 1 7

Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution CS575 lecture 1 8

Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization CS575 lecture 1 9

Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing CS575 lecture 1 10

Now you gotta program it! Parallel programming introduces: Task partitioning, task scheduling Data partitioning, distribution Synchronization Load balancing Latency issues hiding tolerance CS575 lecture 1 11

Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory CS575 lecture 1 12

Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? CS575 lecture 1 13

Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both CS575 lecture 1 14

Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower WHY? HOW? CS575 lecture 1 15

Problem with Wile E. Coyote Architecture For high speed, processors have lots of state Cache, stack, global memory To tolerate latency, we need fast context switch. WHY? No free lunch: can t have both Certainly not if the processor was not designed for both Memory wall: memory gets slower and slower in terms of number of cycles it takes to access Memory hierarchy gets more complex CS575 lecture 1 16

Sequential Algorithms Efficient Sequential Algorithms Minimize time, space Maximize state (avoiding re-computation) Efficiency is portable Efficient program on Pentium ~ Efficient program on Opteron CS575 lecture 1 17

Parallel Algorithms Efficient Parallel Algorithms Use efficient sequential algorithms Maximize parallelism re-computation is sometimes better than communication Minimize overhead synchronization, remote accesses Parallel efficiency is Architecture Dependent CS575 lecture 1 18

Speedup Ideal: n processors à n fold speed up Ideal not always possible. WHY? Tasks are data dependent Not all processors are always busy Remote data needs communication Memory wall PLUS Communication wall Linear speedup: α n speedup (α <= 1) CS575 lecture 1 19

Super linear speedup Super linear speedup: α > 1 Discuss... is it possible? CS575 lecture 1 20

Super linear speedup Super linear speedup: α > 1 Nonsense! Because we can execute the faster parallel program sequentially CS575 lecture 1 21

Super linear speedup Super linear speedup: α > 1 No nonsense!! Because parallel computers do not just have more processors, they have more local memory / caches CS575 lecture 1 22

Parallel Programming Paradigms Implicit parallel programming: Super Compilers Compiler extracts parallelism from sequential code Distributes data, creates and schedules tasks Complication: side effects: -the sequential order of reads and writes to a memory location determines the program outcome -a parallelizing compiler must obey the sequential order of side effecting statements and still create //ism - pointers, aliases, indirect array reference make analyzing which statements access which locations hard or impossible - 40 years of compiler research for general purpose parallel computing has not brought much result. CS575 lecture 1 23

Paradigms cont Implicit parallel programming cont Simple, clean case: Functional Programming (FP) Functions: no side effects, order of execution less constrained F ( P(x,y), Q(y,z) ) P and Q can be executed in parallel Simple single assigment memory model: no pointers, no write after read or write after write hazards (dataflow semantics) FP was long doomed too high level too inefficient, because the simple memory model causes lots of copies FP is coming back: MapReduce approach in data centers (Google) is a data parallel functional paradigm CS575 lecture 1 24

Explicit parallel programming Explicit parallel programming Multithreading: OpenMP Message Passing: MPI Data parallel programming (important niche): CUDA Explicit Parallelism complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( semaphores, locks, messages ) CS575 lecture 1 25

Example 1: Weather Prediction Area, segments 3000*3000*11 cubic miles.1*.1*.1 cubic mile: ~ 10 11 segments Two day prediction half hour time steps: ~ 100 time steps Computation per segment Temp, Pressure, Humidity, Wind speed, Wind direction for each time step in each segment Assume ~ 100 FLOPs per time step per segment CS575 lecture 1 26

Performance: Weather Prediction Computational requirement: 10 15 FLOPs assume one FLOP per clock cycle 1 core: 4 GHz Total serial time: 25*10 4 sec ~ 70 hours Not too good for 48 hour weather prediction CS575 lecture 1 27

Parallel Weather Prediction 1 K workstations, grid connected 10 8 segment computations per processor 10 8 instructions per second 100 instructions per segment computation 100 time steps: 10 4 seconds = ~3 hours Much more acceptable Assumption: Communication not a problem here Why is this assumption reasonable? More workstations: finer grid, better accuracy CS575 lecture 1 28

Example 2: N body problem Astronomy: bodies in space Attract each other: Gravitational force Newtons law O(n 2 ) calculations per snapshot Galaxy: ~ 10 11 bodies -> ~ 10 22 calculations/snapshot Calculation 1 micro sec Snapshot: 10 16 secs = ~10 11 days = ~ 3*10 8 years Is parallelism going to help us? NO What does help? Better algorithm: Barnes Hut Divides the space in quad tree (or oct tree ) Treats far away quads as one body: O(n log n) How much time per snapshot now? CS575 lecture 1 29

Other Challenging Applications Satellite data acquisition: billions of bits / sec Pollution levels, Remote sensing of materials Image recognition Discrete optimization problems Planning, Scheduling, VLSI design Bio-informatics, computational chemistry Airplane/Satellite/Vehicle design Internet (Google search) CS575 lecture 1 30

Application Specific Architectures ASICs: Application Specific Integrated Circuits Levels of specificity Full custom ASICs Standard cell ASICs Field programmable gate arrays Computational models Dataflow graphs Systolic arrays Promising orders of magnitude better performance, lower power CS575 lecture 1 31

ASICS cont How much faster than General purpose? Example: 1D 1024 FFT General purpose machine (G4): 25 micro secs ASIC device (MIT Lincoln Labs): 32 nano secs ASIC device uses 20 milliwatts (100 * less power) Other applications Finite Impulse Response (FIR) Filters Matrix multiply QR decomposition What do these all have in common? CS575 lecture 1 32

Background If you do not have necessary background in analysis of algorithms See the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein Or go online Topics to study Introduction Growth of functions Summations Recurrences CS575 lecture 1 33

O, Ω, Θ Background: Orders of Magnitude f(x) = O(g(x)) iff c, n 0 : f(x) < c.g(x) n> n 0 used for upper bound of algorithm complexity: this particular algorithm takes at most c.g(n) time f(x) = Ω(g(x)) iff c, n 0 : f(x) > c.g(x) n> n 0 used for lower bound of problem complexity: any algorithm for solving this problem takes at least c.g(n) time f(x) = Θ(g(x)) iff f(x)=o(g(x)) and f(x)=ω(g(x)) Tight bound CS575 lecture 1 34

Background: Closed problems Closed problem P: algorithm X with O(X) = Ω(P) eg. Sort has tight bound: Θ(nlog(n)) Problem P has algorithmic gap: P is not closed, eg., all NP Complete problems (problems with polynomial lower bound but currently exponential upper bound, such as TSP) CS575 lecture 1 35

Recurrence Relations Algorithmic complexity often described using recurrence relations: f(n) = R( f(1).. f(n-1) ) Two important types of recurrence relations Linear Divide and Conquer cs420(dl) covers these CS575 lecture 1 36

Repeated substitution Simple recurrence relations (one recurrent term in the rhs) can sometimes be solved using repeated substitution Two types: Linear and DivCo Linear F(n) = af(n-d)+g(n), base: F(1)=v 1 Divco F(n)= af(n/d)+g(n), base: F(1)=v 1 Two questions: what is the pattern how often is it applied until we hit the base case

Linear Example M(n)=2M(n-1)+1, M(1)=1 recognize this recurrence? M(n) = 2M(n-1)+1 = 2(2M(n-2)+1)+1 = 4M(n-2)+2+1 = 4(2M(n-3)+1)+2+1= 8M(n-3)+4+2+1= inductive step 2 k M(n-k)+2 k-1 +2 k-2 +...+2+1= hit base for k = n-1: = 2 n-1 M(1)+2 n-1 +2 n-2 +...+2+1 = 2 n -1 for more on Linear recurrence relations, see 420dl

DivCo example Merge sort: T(n) = 2T(n/2) + n, T(1)=1 n = 2 k T(n)=2(2(T(n/4)+n/2)+n = 4T(n/4) + 2n = 8T(n/8) + 3n... inductive step = 2 k T(n/2 k )+kn hit base for k = logn = 2 k T(n/2 k )+kn = n+kn = O(nlogn)

Another one: binary search f(n) = f(n/2)+c f(1)=1 let n = 2 k f(n)=f(n/2)+c = f(n/4)+2c = f(n/8)+3c = f(n/2 k )+kc = hit base for k=log n: f(1)+ c logn = O(log n)

Master Method Cookbook approach to solution, based on repeated substitution (Cormen et.al. or Rosen) A n = C A n/d +knp A n = O(n p ) if C < d p eg A n = 3 A n/2 +n2 A n = O(n p log(n)) if C = d p eg A n = 2A n/2 +n A n = O(n log d c ) if C > d p eg A n = 3 A n/2 +n Do binary search and merge sort with this method

Examples Merge Sort T(n) = 2T(n/2) + n, T(1)=1 C=? d=? p=? d p =? T(n) = O(??? ) Binary Search f(n) = f(n/2)+c f(1)=1 C=? d=? p=? d p =? f(n) = O(??? ) CS575 lecture 1 42