A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Similar documents
Spring 2011 Prof. Hyesoon Kim

MPI and Hybrid Programming Models. William Gropp

Parallel Programming Survey

Multi-Threading Performance on Commodity Multi-Core Processors

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Introduction to Cloud Computing

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

SYSTEM ecos Embedded Configurable Operating System

OpenMP and Performance

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Shared Address Space Computing: Programming

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Parallel Computing. Shared memory parallel programming with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP

Facing the Challenges for Real-Time Software Development on Multi-Cores

UTS: An Unbalanced Tree Search Benchmark

OpenACC 2.0 and the PGI Accelerator Compilers

Hybrid Programming with MPI and OpenMP

Overview. CMSC 330: Organization of Programming Languages. Synchronization Example (Java 1.5) Lock Interface (Java 1.5) ReentrantLock Class (Java 1.

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Software and the Concurrency Revolution

Bringing Big Data Modelling into the Hands of Domain Experts

FLIX: Fast Relief for Performance-Hungry Embedded Applications

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Chapter 1 Computer System Overview

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Scalability evaluation of barrier algorithms for OpenMP

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

VLIW Processors. VLIW Processors

STINGER: High Performance Data Structure for Streaming Graphs

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

An Implementation Of Multiprocessor Linux

Module: Software Instruction Scheduling Part I

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

Optimizing Loop-Level Parallelism in Cray XMT Applications

SHARED HASH TABLES IN PARALLEL MODEL CHECKING

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Chapter 13: Query Processing. Basic Steps in Query Processing

An Introduction to Parallel Computing/ Programming

MAQAO Performance Analysis and Optimization Tool

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

OMPT and OMPD: OpenMP Tools Application Programming Interfaces for Performance Analysis and Debugging

Introduction to DISC and Hadoop

Control 2004, University of Bath, UK, September 2004

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Algorithm Engineering

Effective Computing with SMP Linux

Software Pipelining - Modulo Scheduling

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

High Performance Computing. Course Notes HPC Fundamentals

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

MPLAB TM C30 Managed PSV Pointers. Beta support included with MPLAB C30 V3.00

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

COS 318: Operating Systems

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Petascale Software Challenges. William Gropp

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Multi-core Programming System Overview

Binary search tree with SIMD bandwidth optimization using SSE

HPC Wales Skills Academy Course Catalogue 2015

Course Development of Programming for General-Purpose Multicore Processors

Abstract Data Type. EECS 281: Data Structures and Algorithms. The Foundation: Data Structures and Abstract Data Types

Parallel Computing for Data Science

Using Eclipse CDT/PTP for Static Analysis

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

GPU Parallel Computing Architecture and CUDA Programming Model

Performance Monitoring of Parallel Scientific Applications

CHAPTER 1 INTRODUCTION

Intel Threading Building Blocks (Intel TBB) 2.2. In-Depth

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Optimizing Load Balance Using Parallel Migratable Objects

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Monitors, Java, Threads and Processes

OpenACC Programming and Best Practices Guide

Distributed Systems [Fall 2012]

Big Data With Hadoop

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Parallel Computing. Benson Muite. benson.

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Implementing MPI-IO Shared File Pointers without File System Support

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Getting OpenMP Up To Speed

Transcription:

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1

Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86 programming model identical to 1982 Symmetric Multi- Processor programming model Unwillingness to adopt new languages Users want to leverage existing investments in code Prefer to incrementally migrate to parallelism OpenMP/MicroTasking Data parallelism Pthreads Task parallelism Wishful Thinking Mixed task and data parallelism

Expressing Synchronization and Parallelism Parallelism and Synchronization are Orthogonal Synchronization Code Explicit Pthreads/ OpenMP Parallelism Implicit Vectorization Data MTA Dataflow Synchronization and parallelism primitives can be mixed OpenMP mixed with Atomic Memory Operations MTA synchronization mixed with OpenMP parallel loops OpenMP mixed with Pthread Mutex Locks OpenMP or Pthreads mixed with Vectorization All of the above mixed with MPI, UPC, CAF, etc. 3

Thread-Centric versus Data-Centric TASK: Map millions of degrees of parallelism onto tens of (thousands of) processors Thread-Centric Manage Threads Optimizes for specific machine organization Requires careful scheduling of moving data to/from thread Difficult to load balance dynamic and nested parallel regions Data-Centric Manage Data Dependencies Compiler is already doing this for ILP, loop optimization, and vectorization Optimizes for concurrency, which is performance portable Moving task to data is a natural option for load balancing

Lock-Free and Wait-Free Algorithms Lock Free and Wait Free Algorithms... Don t really exist Only embarrassingly Parallel algorithms don t use synchronization Compare And Swap... is not lock free or wait-free has no concurrency is a synchronization primitive which corresponds to mutex try-lock in the Pthreads programming model 5

Compare And Swap LOCK# CMPXCHG x86 Locked Compare and Exchange Programming Idioms Similar to mutex try-lock Mutex locks can spin try-lock or yield to the OS/runtime So Called Lock-Free Algorithms Manually coded secondary lock handler Manually coded tertiary lock handler... All this try-lock handler work is not algorithmically efficient... It s Lock-Free Turtles all the way down... Implementation Instruction in all i386 and later processors Efficient for processors sharing caches and memory controllers Not efficient or fair for non-uniform machine organizations Atomic Memory Operations Do Not Scale It is not possible to go 10,000 way parallel on one piece of data. 6

Thread-Centric Parallel Regions OpenMP Implied fork/join scaffolding Parallel regions are separate from loops Unannotated loops: Every thread executes all iterations Annotated loops: Loops are decomposed among existing threads Joining Threads Exit parallel region Barriers Pthreads Fully explicit scaffolding Forking Threads One new thread per PthreadCreate() Loops or trees required to start multiple threads Flow control PthreadBarrier Mutex Lock Joining Threads PthreadJoin return() 7

Data-Centric Parallel Regions on XMT Multiple loops, with different trip counts, after restructuring a reduction, all in a single parallel region: Parallel region 1 in foo Multiple processor implementation Requesting at least 50 streams Loop 2 in foo in region 1 In parallel phase 1 Dynamically scheduled, variable chunks, min size = 26 Compiler generated Loop 3 in foo at line 7 in loop 2 Loop summary: 1 loads, 0 stores, 1 floating point operations 1 instructions, needs 50 streams for full utilization 3 P:$ sum += a[i]; pipelined Loop 4 in foo in region 1 In parallel phase 2 Dynamically scheduled, variable chunks, min size = 8 Compiler generated Loop 5 in foo at line 10 in loop 4 Loop summary: 2 loads, 1 stores, 2 floating point operations 3 instructions, needs 44 streams for full utilization pipelined void foo(int n, double* restrict a, double* restrict b, double* restrict c, double* restrict d) { int i, j; double sum = 0.0; for (i = 0; i < n; i++) ** reduction moved out of 1 loop for (j = 0; j < n/2; j++) 5 P b[j] = c[j] + d[j] * sum; } 8

Parallel Histogram Thread Centric Time = N elements PARALLEL-DO i = 0.. Nelements-1 j = 0 while(j < Nbins && elem[i] < binmax[j]) j++ BEGIN CRITICAL-REGION counts[j]++ Only 1 thread at a time END CRITICAL-REGION Critical region around update to counts array Serial bottleneck in critical region Wastes potential concurrency Data Centric Time = N elements /N bins PARALLEL-DO i = 0.. Nelements-1 j = 0 while(j < Nbins && elem[i] < binmax[j]) j++ INT_FETCH_ADD(counts[j], 1) Updates are atomic Updates to count table are atomic: Requires abundant fine grained synchronization All concurrency can be exploited: Maximum concurrency limited to number of bins Every bin updated simultaneously

Linked List Manipulation Insertion Sort Time = N*N/2 Inserting from same side every time Time = N*N/4 Insert at head or tail, whichever is nearer Unlimited concurrency during search Concurrency during manipulations Thread Parallelism: One update to list at a time Data Parallelism: Between each pair of elements Grow list length by 50% on every step Two phase insertion (search, then lock and modify) 10

Two-Phase List Insertion (OpenMP) 1. Find site to insert at Do not lock list Traverse list serially 2. Perform List Update Enter Critical Region Re-Confirm site is unchanged Update list pointers End Critical Region Only One Insertion at a time Unlimited number of search threads More threads means... more failed inter-node locks more repeated (wasted) searches Wallclock Time = N Parallel Search 1 Serial updates N 11

Two-Phase List Insertion (MTA) 1. Find site to insert at Do not lock list Traverse list serially 2. Perform List Update Lock two elements inserted between Acquire locks in lexicographical order to avoid deadlocks Confirm link between nodes is unchanged Update link pointers of nodes Unlock two elements in reverse order of locking N/4 Maximum concurrent insertions Insert between every pair of nodes More insertion points means fewer failed internode locks Total Time = logn 12

Global HACHAR Initial Data Structure Region Head Non-full Region Chain Length 0 0 Table / Chunk size Next Free Slot = Next Free Slot = 0 Use two-step acquire on length, region linked list pointers, chain pointers. Use int_fetch_add on next free slot to allocate list node. Region 0 Region 1

Global HACHAR Two items inserted Region Head Non-full Region Chain Length 1 1 Hash Function Range Next Free Slot = Next Free Slot = 0 Word Word Id locked length and inserted into head of list Potential contention only on length List node shows example for Bag Of Words Region 0 Region 1

Global HACHAR Collisions Region Head Non-full Region Chain Length 3 1 Table / Chunk size Next Free Slot = Word Word Id Next Free Slot = x Lookup: walk chain, no locking Malloc and free limited to the few region buffer Growing a chain requires lock of only last pointer (int_fetch_add length) Region 0 Region 1

Global HACHAR Region Overflow Region Head Non-full Region Chain Length Next Free Slot = Next Free Slot = Next Free Slot = 1 3 2 Table / Chunk size Word Word Id Region 0 Region 1 Region 2

Once Per Thread vs. Once Per Iteration Allocating Storage Once Per Thread PARALLEL-DO i = 0.. Nthreads-1 float *p = malloc(...) int n_iters = Nelements / Nthreads DO j = i*n_iters.. max(n,(i+1)*n_iters) p[j] =... x[j]... x[j] =... p[j]... ENDDO free(p); END-PARALLEL-DO OpenMP Parallel regions are separate from loops Loop decomposition idiom already captured in conventional pragma syntax Parallel loop, one iteration per thread malloc hoisted out of inner loop, or fused into outer loop Block of serial iterations XMT Abominable Kludge Non-portable pragma semantics Requires separate loops, possibly parallel region 17

Synchronization is a Natural Part of Parallelism Synchronization cannot be avoided, it must be made efficient and easy to use Lock Free algorithms aren t... Sequential execution of atomic ops (compare and swap, fetch and add) Hidden lock semantics in compiler or hardware (change either and you have a bug) Communication-free loop level data parallelism (ie: vectorization) is a limited kind of parallelism Synchronization needed for many purposes Atomicity Enforce order dependencies Manage threads Must be abundant Don t want to worry about allocating or rationing synchronization variables I m a lock free scalable parallel algorithm! 18

Conclusions Synchronization is a natural part of parallel programs Synchronization must be abundant, efficient, and easy to use Some latencies can be minimized, others must be tolerated Parallelism can mask latency Enough parallelism makes latency irrelevant Fine-grained synchronization improves utilization More opportunities for exploiting parallelism Proactively share resources Parallelism is performance portable Same programming model from desktop to supercomputer Quality of Service determined by amount of concurrency in hardware, not optimizations in software AMOs versus Tag Bits Tags make fine-grained parallelism possible More Data == More Concurrency AMOs do not scale 19

20