BLM 413E - Parallel Programming Lecture 3

Similar documents
Introduction to Cloud Computing

Virtual Machines.

HPC Wales Skills Academy Course Catalogue 2015

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Introduction to Virtual Machines

Design and Implementation of the Heterogeneous Multikernel Operating System

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

High Performance Computing

Multi-core Programming System Overview

GPUs for Scientific Computing

Kernel Types System Calls. Operating Systems. Autumn 2013 CS4023

Chapter 6, The Operating System Machine Level

IBM Software Group. Lotus Domino 6.5 Server Enablement

Introduction to GPU hardware and to CUDA

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Parallel Algorithm Engineering

Parallel Programming Survey

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Spring 2011 Prof. Hyesoon Kim

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

<Insert Picture Here> Oracle Database Support for Server Virtualization Updated December 7, 2009

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Operating System Compiler Bits Part Number CNL 6.0 AMD Opteron (x86-64) Windows XP x64 Intel C Microsoft Platform SDK 64 P10312

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Multi-Threading Performance on Commodity Multi-Core Processors

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

Course Development of Programming for General-Purpose Multicore Processors

Real-time processing the basis for PC Control

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA programming on NVIDIA GPUs

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

x64 Servers: Do you want 64 or 32 bit apps with that server?

A Pattern-Based Approach to. Automated Application Performance Analysis

How To Visualize Performance Data In A Computer Program

COM 444 Cloud Computing

Parallel Computing with MATLAB

Virtualization and the U2 Databases

Components of a Computing System. What is an Operating System? Resources. Abstract Resources. Goals of an OS. System Software

Using the Windows Cluster

Sourcery Overview & Virtual Machine Installation

ultra fast SOM using CUDA

Parallel Computing. Parallel shared memory computing with OpenMP

Scalability evaluation of barrier algorithms for OpenMP

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

Software: Systems and Application Software

Chapter 2 Parallel Architecture, Software And Performance

White Paper on NETWORK VIRTUALIZATION

Multilevel Load Balancing in NUMA Computers

End-user Tools for Application Performance Analysis Using Hardware Counters

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

VMware Server 2.0 Essentials. Virtualization Deployment and Management

Introduction to GPU Programming Languages

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

An examination of the dual-core capability of the new HP xw4300 Workstation

L20: GPU Architecture and Models


İSTANBUL AYDIN UNIVERSITY

HPC Software Requirements to Support an HPC Cluster Supercomputer

System requirements for MuseumPlus and emuseumplus

Full and Para Virtualization

Building Applications Using Micro Focus COBOL

DISTRIBUTED AND PARALLELL DATABASE

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Bringing Big Data Modelling into the Hands of Domain Experts

Incorporating Multicore Programming in Bachelor of Science in Computer Engineering Program

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

- An Essential Building Block for Stable and Reliable Compute Clusters

Server and Storage Virtualization

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Scientific Computing Programming with Parallel Objects

Chapter 16 Distributed Processing, Client/Server, and Clusters

An Analysis of Container-based Platforms for NFV

Decomposition into Parts. Software Engineering, Lecture 4. Data and Function Cohesion. Allocation of Functions and Data. Component Interfaces

Eloquence Training What s new in Eloquence B.08.00

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

OpenACC 2.0 and the PGI Accelerator Compilers

Virtualization. Michael Tsai 2015/06/08

Linux tools for debugging and profiling MPI codes

Mobile Devices and Systems Lesson 02 Handheld Pocket Computers and Mobile System Operating Systems

Weighted Total Mark. Weighted Exam Mark

Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003

Section 1.4. Java s Magic: Bytecode, Java Virtual Machine, JIT,

Virtualization for Cloud Computing

Transcription:

BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1

Parallel Programming Models

Parallel Programming Models Overview There are several parallel programming models in common use: Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)

Parallel Programming Models Overview Parallel programming models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples from the past are discussed below.

Parallel Programming Models SHARED memory model on a DISTRIBUTED memory machine: Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed across networked machines, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory".

Parallel Programming Models DISTRIBUTED memory model on a SHARED memory machine: Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture, where every task has direct access to global address space spread across all machines. However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used.

Parallel Programming Models Which model to use? This is often a combination of what is available and personal choice. There is no "best" model, although there certainly are better implementations of some models over others. The following sections describe each of the models mentioned above, and also discuss some of their actual implementations.?

Shared Memory Model (without threads) In this programming model, tasks share a common address space, which they read and write to asynchronously. Various mechanisms such as locks / semaphores may be used to control access to the shared memory. An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks. Program development can often be simplified.

Shared Memory Model (without threads) An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality. Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors use the same data. Unfortunately, controlling data locality is hard to understand and beyond the control of the average user.

Shared Memory Model (with threads) In the threads model of parallel programming model is a type of shared memory programming. In this model, a single process can have multiple, concurrent execution paths. The most simple analogy that can be used to describe threads is the concept of a single program that includes a number of subroutines:

Shared Memory Model (with threads) The main program a.out is scheduled to run by the native operating system. a.out loads and acquires all of the necessary system and user resources to run. a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently.

Shared Memory Model (with threads) A thread's work may best be described as a subroutine within the main program. Any thread can execute any subroutine at the same time as other threads. Each thread has local data, but also, shares the entire resources of a.out. This saves the overhead associated with replicating a program's resources for each thread. Each thread also benefits from a global memory view because it shares the memory space of a.out.

Shared Memory Model (with threads) Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time. Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed.

Shared Memory Model (with threads) Implementations OpenMP (Open Multiprocessing is an API that supports multiplatform shared memory multiprocessing programming in C, C++ and Fortran on most processor architecture and operating systems, including Solaris, AIX, HP-UX, GNU/Linux, Mac OS X, and Windows platforms. OpenMP is managed by a group of major computer hardware and software vendors, including AMD, IBM, Intel, Cray, HP, Fujitsu, Nvidia, NEC, Microsoft, Texas ınstruments, Oracle Corporation and more.

Shared Memory Model (with threads) Implementations It consists of a set of compiler directives, library routines and environment variables that influence run-time behavior.

Distributed Memory/Message Passing Model This model demonstrates the following characteristics A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. Tasks exchange data through communications by sending and receiving messages. Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.

Distributed Memory/Message Passing Model Implementations From a programming perspective, message passing implementations usually comprise a library of subroutines. Calls to these subroutines are embedded in source code. The programmer is responsible for determining all parallelism. Historically, a variety of message passing libraries have been available since the 1980s. These implementations differed substantially from each other making it difficult for programmers to develop portable applications. In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations.

Distributed Memory/Message Passing Model Implementations: Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was released in 1996. Both MPI specifications are available on the web at http://www-unix.mcs.anl.gov/mpi/. MPI is now the "de facto" industry standard for message passing, replacing virtually all other message passing implementations used for production work. MPI implementations exist for virtually all popular parallel computing platforms. Not all implementations include everything in both MPI1 and MPI2. More Information: MPI tutorial: computing.llnl.gov/tutorials/mpi

Data Parallel Model Most of the parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an array or cube. A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure. Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".

Data Parallel Model On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task.

Hybrid Model Combines more than one of the previously described programming models. A common example is the combination of the message passing model (MPI) with the threads model (OpenMP). Threads perform computationally intensive kernels using local, on-node data Communications between processes on different nodes occurs over the network using MPI. Another similar and increasingly popular example is using MPI with GPU (Graphics Processing Unit) programming.

SPMD and MPMD Single Program Multiple Data (SPMD) SPMD is actually a high level programming model that can be built upon any combination of the previously mentioned parallel programming models SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be; threads, message passing, data parallel or hybrid MULTIPLE DATA: All tasks may use different data

SPMD and MPMD Single Program Multiple Data (SPMD) SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have to execute the entire program - perhaps only a portion of it. The SPMD model, using message passing or hybrid programming, is probably the most commonly used parallel programming model for multi-node clusters

SPMD and MPMD Multiple Program Multiple Data (MPMD) MPMD is actually a high level programming model that can be built upon any combination of the previously mentioned parallel programming models MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid

SPMD and MPMD Multiple Program Multiple Data (MPMD) MULTIPLE DATA: All tasks may use different data MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition