Advanced Computer Architecture



Similar documents
Performance metrics for parallel systems

Parallel Scalable Algorithms- Performance Parameters

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Performance metrics for parallelism

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Using In-Memory Computing to Simplify Big Data Analytics

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Chapter 2 Parallel Architecture, Software And Performance

BSC vision on Big Data and extreme scale computing

CPU Performance. Lecture 8 CAP

Performance evaluation

Understanding the Benefits of IBM SPSS Statistics Server

OpenCL Programming for the CUDA Architecture. Version 2.3

Binary search tree with SIMD bandwidth optimization using SSE

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Resource Allocation and the Law of Diminishing Returns

Chapter 2. Why is some hardware better than others for different programs?

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

CSEE W4824 Computer Architecture Fall 2012

Control 2004, University of Bath, UK, September 2004

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance Metrics for Parallel Programs. 8 March 2010

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

2: Computer Performance

RevoScaleR Speed and Scalability

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Map-Reduce for Machine Learning on Multicore

An Introduction to Parallel Computing/ Programming

Impacts of Operating Systems on the Scalability of Parallel Applications

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

How To Improve Performance On A Single Chip Computer

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Parallel Computing. Benson Muite. benson.

CUDA Programming. Week 4. Shared memory and register

Introduction to Cloud Computing

High Performance Computing. Course Notes HPC Fundamentals

EEM 486: Computer Architecture. Lecture 4. Performance

Improving System Scalability of OpenMP Applications Using Large Page Support

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

MPI and Hybrid Programming Models. William Gropp

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

DEDUPLICATION NOW AND WHERE IT S HEADING. Lauren Whitehouse Senior Analyst, Enterprise Strategy Group

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Mobile and Heterogeneous databases Database System Architecture. A.R. Hurson Computer Science Missouri Science & Technology

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Understanding Hardware Transactional Memory

Introduction to Parallel and Distributed Databases

How To Understand The Design Of A Microprocessor

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Parallel Programming

ASPERA HIGH-SPEED TRANSFER SOFTWARE. Moving the world s data at maximum speed

A Cloud Computing Approach for Big DInSAR Data Processing

In Memory Accelerator for MongoDB

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Measuring Parallel Processor Performance

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

How To Create A Multi Disk Raid

Parallel Processing and Software Performance. Lukáš Marek

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Lecture 23: Multiprocessors

Technology Brochure New Technology for the Digital Consumer

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications

Application Note License-Exempt Gigabit Ethernet Microwave Radio Applications

PARALLELS CLOUD STORAGE

White Paper. Recording Server Virtualization

Cellular Computing on a Linux Cluster

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

An Application of Hadoop and Horizontal Scaling to Conjunction Assessment. Mike Prausa The MITRE Corporation Norman Facas The MITRE Corporation

Infrastructure Matters: POWER8 vs. Xeon x86

Performance Analysis and Optimization Tool

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

CS 575 Parallel Processing

FPGA area allocation for parallel C applications

Scalability and Classifications

Spring 2011 Prof. Hyesoon Kim

Data Center and Cloud Computing Market Landscape and Challenges

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Data Backup and Archiving with Enterprise Storage Systems

Week 1 out-of-class notes, discussions and sample problems

Applying Design Patterns in Distributing a Genetic Algorithm Application

64-Bit versus 32-Bit CPUs in Scientific Computing

Performance analysis with Periscope

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

Application-Focused Flash Acceleration

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

2009 Oracle Corporation 1

Thread level parallelism

Operation Count; Numerical Linear Algebra

Transcription:

Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de Execution time, Throughput, Speedup What is better? Question not precise! Aeroplane NY to Paris Speed Passengers Throughput (Persons/h) Boeing 747 6.5 h 610 mph 470 72.3 Concorde 3 h 1350 mph 132 44.0 Execution time T (response time, latency) [sec], [h],... Throughput X (bandwidth) [1/sec], [1/h],...

Definition of Speedup Speedup S (Acceleration): A is S times faster than B T(B) S = = 6.5h / 3h = 2.167 T(A) Speedup is a measure for the judgement of the processing of a single task (passenger). Throughput is a measure for the judgement of the processing of the whole work load (with what aeroplane type can an airline transport more passengers?). Amdahl s Law In 1967, Gene Amdahl (developer of the IBM 360/xx computer) defined the performance increase of a program with fixed problem size for parallel processing as: T s Sequential execution time Speedup S(p) = = f * T s + (1 f) * T s /p Exe. time (seq.+ parallel) with T s :Execution time for sequential processing of the whole task f : Fraction of the execution time for program segments which cannot run in parallel (f = 0..1) p : Number of parallel l processing elements (processors) for p : S(p) = 1 / f, or for f 0: S(p) = p

Acceleration of Programs For efficient parallel processing it is necessary to achieve speedups that t are close to the number of processors used. S(p) Speedup S(p) = p (ideal) S(p) < p (real) p Number of processors Definition of Efficiency Efficiency defines the ratio of speedup and number of processors used. Efficiency indicates which share of the processor performance can be utilized. E(p) = S(p) / p T(sequential) = p * T(parallel) with 0 < E(p) 1 iency Effici 6 Number of processors p

Example: Dot Product x, y are vectors with n elements. Distribute x, y to p processors! The partial dot products are locally computed by each processor. The global dot product is computed with a reduction algorithm by summation of the partial results.... Example: Dot Product Reduction algorithm for p = 2³= 8 k i The reduction requires d = log 2 p steps. In step i (i=0,..d-1), processor k+2 i sends to processor k its local dot product β i k+2 i, and processor k computes then β (i+1) k = β i k + β i k+2 i with k = 0,... (step 2 i+1, k < p).

... Example: Dot Product For p = 2 d t word T(seq) For n >> p 9 Algorithm Problem size... Example: Dot Product Efficiency E(p) = S(p) / p The efficiency increases with growing problem size n (for fixed p) and decreases with growing number of processors p (for fixed n). How must the problem size n grow for increasing p if the efficiency shall be constant? The efficiency remains constant if n grows with p * log 2 p.

Scalability A computer architecture or a program is scalable if the efficiency of program processing remains constant for increasing processor number. In general, this is only possible for a simultaneous increase of the problem size. A program (an algorithm) is perfectly scalable if a linear increase of n is enough in case of a linear increase of p to achieve constant efficiency. Speedup is usually reduced by additional parallel overhead: V(p) = p*t(p) T(seq)... Scalability Definition of mean parallel overhead: V(p) = V(p) / p Causes: Startup costs of an event (process or communication start) Costs for the distribution/administration of shared data Costs for synchronization What is better? Less communication by bigger work packages for fewer processors (from fine to coarse granularity) Smaller work packages distributed to more processors