Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Similar documents

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Performance metrics for parallel systems

LSN 2 Computer Processors

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

An Implementation Of Multiprocessor Linux

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

2: Computer Performance

Parallelism and Cloud Computing

CHAPTER 1 INTRODUCTION

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Cellular Computing on a Linux Cluster

Control 2004, University of Bath, UK, September 2004

Guided Performance Analysis with the NVIDIA Visual Profiler

Ready Time Observations

Parallel Scalable Algorithms- Performance Parameters

Spring 2011 Prof. Hyesoon Kim

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Load Balancing. Load Balancing 1 / 24

Chapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1

Multi-GPU Load Balancing for Simulation and Rendering

OpenCL Programming for the CUDA Architecture. Version 2.3

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Measuring Parallel Processor Performance

Symmetric Multiprocessing

Performance Evaluation for Parallel Systems: A Survey

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Course Development of Programming for General-Purpose Multicore Processors

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Multicore Programming with LabVIEW Technical Resource Guide

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Introduction to Cloud Computing

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Testing Database Performance with HelperCore on Multi-Core Processors

32 K. Stride K 8 K 32 K 128 K 512 K 2 M

A Lab Course on Computer Architecture

Overlapping Data Transfer With Application Execution on Clusters

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Real-Time Operating Systems for MPSoCs

System Software Integration: An Expansive View. Overview

Lecture 23: Multiprocessors

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Pipelining Review and Its Limitations

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

An Introduction to Parallel Computing/ Programming

Selection of Techniques and Metrics

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Effective Computing with SMP Linux

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

Parallel Algorithm Engineering

System Models for Distributed and Cloud Computing

Performance evaluation

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

THE NAS KERNEL BENCHMARK PROGRAM

Week 1 out-of-class notes, discussions and sample problems

Optimizing Code for Accelerators: The Long Road to High Performance

QSEM SM : Quantitative Scalability Evaluation Method

A Guide to Getting Started with Successful Load Testing

Parallelization: Binary Tree Traversal

Chapter 1 Computer System Overview

Load Balancing and Data Locality in Adaptive Hierarchical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity

RevoScaleR Speed and Scalability

Putting Checkpoints to Work in Thread Level Speculative Execution

Mesh Generation and Load Balancing

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Applying Design Patterns in Distributing a Genetic Algorithm Application

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Transcription:

Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Objective Be familiar with basic multiprocessor architectures and be able to discuss speedup in multiprocessor systems, including common causes of less-than-linear and super-linear speedup 2

Basic multiprocessor architectures 3

Multiprocessors A set of processors connected by a communications network 4

Basic multiprocessor architecture 5

Early multiprocessors Often used processors that had been specifically designed for use in a multiprocessor to improve efficiency 6

Most current multiprocessors Taking advantage of the large sales volumes of uniprocessors to lower prices The same processors that are found in contemporary uniprocessor systems 7

Most current multiprocessors Number of transistors that can be placed on a chip increases Features to support multiprocessor systems integrated into processors intended primarily for the uniprocessor More efficient multiprocessor systems built around these uniprocessors 8

Performance Characteristics of Multiprocessors 9

Multiprocessors speedup Designers of uniprocessor systems, measure performance in terms of speedup Similarly, multiprocessor architects measure performance in terms of speedup 10

Multiprocessors speedup Speedup generally refers to how much faster a program runs on a system with n processors than it does on a system with one processor of the same type 11

Loop unrolling Transforming a loop with N iterations into a loop with N/M iterations Each of the iterations in the new loop does the work of M iterations of the old loop 12

Speedup on increasing the number of processors 13

Ideal and Practical speedups of Multiprocessors systems 14

Ideal Speedup in Multiprocessor System Linear speedup the execution time of program on an n-processor system would be l/n th of the execution time on a one-processor system 15

Practical speedup in Multiprocessor System When the number of processors is small, the system achieves near-linear speedup As the number of processors increases, the speedup curve diverges from the ideal, eventually flattening out or even decreasing 16

Limitations on speedup of Multiprocessors systems 17

Limitations Interprocessor communication Synchronization Load Balancing 18

Limitations of Interprocessor communication Whenever one processor generates (computes) a value that is needed by the fraction of the program running on another processor, that value must be communicated to the processors that need it, which takes time On a uniprocessor system, the entire program runs on one processor, so there is no time lost to interprocessor communication 19

Limitations of Synchronization It is often necessary to synchronize the processors to ensure that they have all completed some phase of the program before any processor begins working on the next phase of the program 20

Example of Synchronization in Programs on Multiprocessors systems 21

Example of Programs that simulate physical phenomena, such as game Generally divide time into steps of fixed duration Require that all processors have completed their simulation of a given time-step before any processor can proceed to the next 22

Example of Programs that simulate physical phenomena, such game The simulation of the next time-step can be based on the results of the simulation of the current time-step This synchronization requires interprocessor communication, introducing overhead that is not found in uniprocessor systems 23

Effect of the greater amount of time required to communicate between processors Lowers the speedup that programs running on the system Affects the amount of time required to communicate data between the parts of the program running on each processor 24

Effect of the greater amount of time required to communicate between processors Also affects the amount of time required for synchronization Synchronization typically implemented out of a sequence of interprocessor communications 25

Load balancing on Multiprocessors systems 26

Load balancing In many parallel applications, difficult to divide the program across the processors When each processor working the same amount of time not possible, some of the processors complete their tasks early and are then idle waiting for the others to finish 27

Algorithms that dynamically balance an application s load System with low interprocessor communication latencies can take advantage of by moving work from processors that are taking longer to complete their part of the program to other processors so that no processors are ever idle 28

Algorithms that dynamically balance an application s load Systems with longer communication latencies benefit less from these algorithms Communication latency determines how long it takes to move a unit of work from one processor to another 29

Super-linear Speedup of Performance in Multiprocessor System 30

Superlinear speedups Achieving speedup of greater than n on n- processor systems Each of the processors in an n-processor multiprocessor to complete its fraction of the program in less than l/nth of the program s execution time on a uniprocessor 31

Effect of increased cache size Reducing the average memory latency When the data required by the portion of the program that runs on each processor can fit in that processor s cache Improvement in performance 32

Better structure Some programs perform less work when executed on multiprocessor than they do when executed on a uniprocessor Achieve superlinear speedups 33

Example of Better structure in case of multiprocessor system Assume a program that search for the best answer to a problem by examining all of the possibilities Sometimes exhibit superlinear speedup because the multiprocessor version examines the possibilities in a different order, one that allows it to rule out more possibilities without examining them 34

Example of Better structure in case of multiprocessor system The multiprocessor version has to examine fewer total possibilities than the uniprocessor program, it can complete the search with greater than linear speedup 35

Example of No Better structure in case of multiprocessor system Rewriting the uniprocessor program to examine possibilities in the same order as the multiprocessor program would improve its performance, bringing the speedup back down to linear or less 36

Summary 37

We Learnt Extracting of parallelism by multiprocessors Speedup is generally near linear for less number of processors and becomes below linear with more systems Limitations in getting linear speedup Interprocessor communication, synchronization and load balancing 38

We Learnt Super-linear performance when the data can fit into caches alone in the portion of the program running on each processor 39

End of Lesson 01 on Performance characteristics of Multiprocessor Architectures and Speedup 40